We study the offline contextual bandit problem, where we aim to acquire an optimal policy using observational data. However, this data usually contains two deficiencies: (i) some variables that confound actions are not observed, and (ii) missing observations exist in the collected data. Unobserved confounders lead to a confounding bias and missing observations cause bias and inefficiency problems. To overcome these challenges and learn the optimal policy from the observed dataset, we present a new algorithm called Causal-Adjusted Pessimistic (CAP) policy learning, which forms the reward function as the solution of an integral equation system, builds a confidence set, and greedily takes action with pessimism. With mild assumptions on the data, we develop an upper bound to the suboptimality of CAP for the offline contextual bandit problem.
翻译:摘要:本文研究离线情境赌博机问题,目标是通过观测数据获取最优策略。然而,此类数据通常存在两类缺陷:(i) 部分混淆动作的变量未被观测;(ii) 收集的数据中存在缺失观测。未观测混杂变量会导致混杂偏差,而缺失观测则引发偏差与效率问题。为克服上述挑战并从观测数据中学习最优策略,我们提出一种名为因果调整悲观策略学习(Causal-Adjusted Pessimistic, CAP)的新算法:将奖励函数构建为积分方程组的解,建立置信集,并以悲观方式贪婪地选择动作。在数据的温和假设下,我们推导了CAP在离线情境赌博机问题中最优性差距的上界。