In this paper, we investigate transfer learning in partially observable contextual bandits, where agents have limited knowledge from other agents and partial information about hidden confounders. We first convert the problem to identifying or partially identifying causal effects between actions and rewards through optimization problems. To solve these optimization problems, we discretize the original functional constraints of unknown distributions into linear constraints, and sample compatible causal models via sequentially solving linear programmings to obtain causal bounds with the consideration of estimation error. Our sampling algorithms provide desirable convergence results for suitable sampling distributions. We then show how causal bounds can be applied to improving classical bandit algorithms and affect the regrets with respect to the size of action sets and function spaces. Notably, in the task with function approximation which allows us to handle general context distributions, our method improves the order dependence on function space size compared with previous literatures. We formally prove that our causally enhanced algorithms outperform classical bandit algorithms and achieve orders of magnitude faster convergence rates. Finally, we perform simulations that demonstrate the efficiency of our strategy compared to the current state-of-the-art methods. This research has the potential to enhance the performance of contextual bandit agents in real-world applications where data is scarce and costly to obtain.
翻译:本文研究部分可观测上下文赌博机中的迁移学习问题,其中智能体对其他智能体的知识有限,且对隐藏混杂因素的信息不完整。我们首先将该问题转化为通过优化问题对动作与奖励间的因果效应进行识别或部分识别。为求解这些优化问题,我们将未知分布的原始函数约束离散为线性约束,并通过顺序求解线性规划来采样兼容的因果模型,在考虑估计误差的条件下获得因果界。所提出的采样算法在适当采样分布下具有理想的收敛性。我们进而证明因果界可如何改进经典赌博机算法,并影响关于动作集大小和函数空间的遗憾值。值得注意的是,在允许处理通用上下文分布的函数近似任务中,与现有文献相比,我们的方法改进了对函数空间大小的阶数依赖。我们正式证明了因果增强算法优于经典赌博机算法,并实现了数量级更快的收敛速度。最后通过仿真实验证明了该策略相比当前最先进方法的效率。本研究具有提升数据稀缺且获取成本高昂的实际场景中上下文赌博机智能体性能的潜力。