Contextual bandit algorithms are essential for solving real-world decision making problems. In practice, collecting a contextual bandit's feedback from different domains may involve different costs. For example, measuring drug reaction from mice (as a source domain) and humans (as a target domain). Unfortunately, adapting a contextual bandit algorithm from a source domain to a target domain with distribution shift still remains a major challenge and largely unexplored. In this paper, we introduce the first general domain adaptation method for contextual bandits. Our approach learns a bandit model for the target domain by collecting feedback from the source domain. Our theoretical analysis shows that our algorithm maintains a sub-linear regret bound even adapting across domains. Empirical results show that our approach outperforms the state-of-the-art contextual bandit algorithms on real-world datasets.
翻译:上下文赌博机算法对于解决现实世界中的决策问题至关重要。在实践中,从不同领域收集上下文赌博机的反馈可能涉及不同的成本。例如,测量小鼠(作为源领域)和人类(作为目标领域)的药物反应。遗憾的是,将上下文赌博机算法从源领域适应到存在分布偏移的目标领域仍然是一个重大挑战,且在很大程度上尚未被探索。在本文中,我们提出了首个用于上下文赌博机的通用领域自适应方法。我们的方法通过收集源领域的反馈来为目标领域学习一个赌博机模型。我们的理论分析表明,即使跨领域适应,我们的算法也能保持次线性的遗憾界。实证结果表明,我们的方法在真实世界数据集上优于最先进的上下文赌博机算法。