Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing new policies using existing data without costly experimentation. However, current OPE methods, such as Inverse Probability Weighting (IPW) and Doubly Robust (DR) estimators, suffer from high variance, particularly in cases of low overlap between target and behavior policies or large action and context spaces. In this paper, we introduce a new OPE estimator for contextual bandits, the Marginal Ratio (MR) estimator, which focuses on the shift in the marginal distribution of outcomes $Y$ instead of the policies themselves. Through rigorous theoretical analysis, we demonstrate the benefits of the MR estimator compared to conventional methods like IPW and DR in terms of variance reduction. Additionally, we establish a connection between the MR estimator and the state-of-the-art Marginalized Inverse Propensity Score (MIPS) estimator, proving that MR achieves lower variance among a generalized family of MIPS estimators. We further illustrate the utility of the MR estimator in causal inference settings, where it exhibits enhanced performance in estimating Average Treatment Effects (ATE). Our experiments on synthetic and real-world datasets corroborate our theoretical findings and highlight the practical advantages of the MR estimator in OPE for contextual bandits.
翻译:上下文赌博机中的离线策略评估(OPE)对于利用现有数据评估新策略、避免昂贵实验至关重要。然而,当前OPE方法(如逆概率加权(IPW)和双稳健(DR)估计量)存在高方差问题,尤其在目标策略与行为策略重叠度低或动作空间与上下文空间较大时表现明显。本文提出一种新的上下文赌博机OPE估计量——边际比率(MR)估计量,该估计量聚焦于结果变量$Y$的边际分布偏移而非策略本身。通过严谨的理论分析,我们论证了MR估计量相比传统IPW和DR方法在方差缩减方面的优势。此外,我们建立了MR估计量与当前最先进的边际化逆倾向得分(MIPS)估计量之间的联系,证明MR在广义MIPS估计量族中具有更低的方差。我们进一步展示了MR估计量在因果推断场景中的效用,其在估计平均处理效应(ATE)时表现出更优性能。基于合成数据集和真实数据集的实验验证了理论发现,并凸显了MR估计量在上下文赌博机OPE中的实际优势。