We study policy evaluation of offline contextual bandits subject to unobserved confounders. Sensitivity analysis methods are commonly used to estimate the policy value under the worst-case confounding over a given uncertainty set. However, existing work often resorts to some coarse relaxation of the uncertainty set for the sake of tractability, leading to overly conservative estimation of the policy value. In this paper, we propose a general estimator that provides a sharp lower bound of the policy value using convex programming. The generality of our estimator enables various extensions such as sensitivity analysis with f-divergence, model selection with cross validation and information criterion, and robust policy learning with the sharp lower bound. Furthermore, our estimation method can be reformulated as an empirical risk minimization problem thanks to the strong duality, which enables us to provide strong theoretical guarantees of the proposed estimator using techniques of the M-estimation.
翻译:我们研究了在存在未观测混淆因素情况下离线情境强盗策略评估的问题。灵敏度分析方法通常用于在给定不确定集上估计最坏混淆情形下的策略价值。然而,现有工作往往为了可解性而对不确定集采用某种粗略松弛,导致策略价值的估计过于保守。本文提出了一种通用估计量,通过凸规划给出策略价值的精确下界。该估计量的通用性使其能够实现多种扩展,例如基于f-散度的灵敏度分析、基于交叉验证和信息准则的模型选择,以及利用精确下界进行鲁棒策略学习。此外,借助强对偶性,我们的估计方法可重构为经验风险最小化问题,从而能够利用M估计技术为所提估计量提供强有力的理论保证。