Policy learning using historical observational data is an important problem that has found widespread applications. Examples include selecting offers, prices, advertisements to send to customers, as well as selecting which medication to prescribe to a patient. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data -- an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data. We first present a policy evaluation procedure that allows us to assess how well the policy does under the worst-case environment shift. We then establish a central limit theorem type guarantee for this proposed policy evaluation scheme. Leveraging this evaluation scheme, we further propose a novel learning algorithm that is able to learn a policy that is robust to adversarial perturbations and unknown covariate shifts with a performance guarantee based on the theory of uniform convergence. Finally, we empirically test the effectiveness of our proposed algorithm in synthetic datasets and demonstrate that it provides the robustness that is missing using standard policy learning algorithms. We conclude the paper by providing a comprehensive application of our methods in the context of a real-world voting dataset.
翻译:利用历史观测数据进行策略学习是一个重要问题,已在众多领域得到广泛应用,例如选择向客户发送的优惠、定价、广告,以及为患者推荐药物。然而,现有文献基于一个关键假设:所学策略将在与生成数据的过去环境相同的未来环境中部署——这一假设往往不成立或过于粗略。本文放宽了这一假设,旨在利用不完整的观测数据学习一种分布鲁棒策略。我们首先提出一种策略评估流程,用以评估策略在最坏环境偏移下的表现;随后为该评估方案建立了中心极限定理类型的理论保证。借助这一评估方案,我们进一步提出一种新型学习算法,能够学习出对对抗性扰动和未知协变量偏移具有鲁棒性的策略,并基于一致收敛理论提供性能保证。最后,通过合成数据集上的实证检验,验证了所提算法的有效性,证明其能够提供标准策略学习算法所缺失的鲁棒性。我们以在真实世界投票数据集上的综合应用作为全文总结。