We consider the problem of using observational bandit feedback data from multiple heterogeneous data sources to learn a personalized decision policy that robustly generalizes across diverse target settings. To achieve this, we propose a minimax regret optimization objective to ensure uniformly low regret under general mixtures of the source distributions. We develop a policy learning algorithm tailored to this objective, combining doubly robust offline policy evaluation techniques and no-regret learning algorithms for minimax optimization. Our regret analysis shows that this approach achieves the minimal worst-case mixture regret up to a moderated vanishing rate of the total data across all sources. Our analysis, extensions, and experimental results demonstrate the benefits of this approach for learning robust decision policies from multiple data sources.
翻译:本文研究利用来自多个异构数据源的观测式赌博机反馈数据,学习一种能够在多样化目标场景中鲁棒泛化的个性化决策策略。为实现这一目标,我们提出了一种极小化极大遗憾优化目标,以确保在源分布的一般混合下均能实现较低的遗憾。针对该目标,我们开发了一种策略学习算法,该算法结合了双重鲁棒离线策略评估技术与极小化极大优化的无遗憾学习算法。遗憾分析表明,该方法能以所有数据源总数据量的适度收敛速率,实现最小化最坏情况混合遗憾。我们的理论分析、扩展研究及实验结果均验证了该方法在利用多源数据学习鲁棒决策策略方面的优势。