We study policy evaluation of offline contextual bandits subject to unobserved confounders. Sensitivity analysis methods are commonly used to estimate the policy value under the worst-case confounding over a given uncertainty set. However, existing work often resorts to some coarse relaxation of the uncertainty set for the sake of tractability, leading to overly conservative estimation of the policy value. In this paper, we propose a general estimator that provides a sharp lower bound of the policy value using convex programming. The generality of our estimator enables various extensions such as sensitivity analysis with f-divergence, model selection with cross validation and information criterion, and robust policy learning with the sharp lower bound. Furthermore, our estimation method can be reformulated as an empirical risk minimization problem thanks to the strong duality, which enables us to provide strong theoretical guarantees of the proposed estimator using techniques of the M-estimation.
翻译:本文研究在存在未观测混杂因素的情况下离线上下文赌博机的策略评估问题。敏感性分析方法通常用于估计给定不确定性集合中最坏情况混杂下的策略价值。然而,现有研究常为计算可行性而对不确定性集合进行粗略松弛,导致策略价值估计过于保守。本文提出一种通用估计器,通过凸规划提供策略价值的尖锐下界。该估计器的通用性支持多种扩展,如基于f-散度的敏感性分析、基于交叉验证与信息准则的模型选择,以及利用尖锐下界进行稳健策略学习。此外,借助强对偶性,我们的估计方法可重构为经验风险最小化问题,从而能够运用M-估计技术为所提估计器提供严格的理论保证。