We study policy evaluation of offline contextual bandits subject to unobserved confounders. Sensitivity analysis methods are commonly used to estimate the policy value under the worst-case confounding over a given uncertainty set. However, existing work often resorts to some coarse relaxation of the uncertainty set for the sake of tractability, leading to overly conservative estimation of the policy value. In this paper, we propose a general estimator that provides a sharp lower bound of the policy value using convex programming. The generality of our estimator enables various extensions such as sensitivity analysis with f-divergence, model selection with cross validation and information criterion, and robust policy learning with the sharp lower bound. Furthermore, our estimation method can be reformulated as an empirical risk minimization problem thanks to the strong duality, which enables us to provide strong theoretical guarantees of the proposed estimator using techniques of the M-estimation.
翻译:我们研究了存在未观测混杂因素时离线上下文多臂赌博机的策略评估。敏感性分析方法通常用于在给定不确定性集下对抗最坏情况混杂来估计策略值。然而,现有工作往往为了可解性而对不确定性集进行粗糙松弛,导致对策略值的过度保守估计。本文提出一种通用估计量,通过凸规划获得策略值的严格下界。该估计量的通用性使其能够实现多种扩展,例如基于f-散度的敏感性分析、结合交叉验证和信息准则的模型选择,以及基于严格下界的鲁棒策略学习。此外,得益于强对偶性,我们的估计方法可重新表述为经验风险最小化问题,从而能够借助M估计技术为所提估计量提供强有力的理论保证。