We study off-policy evaluation (OPE) of contextual bandit policies for large discrete action spaces where conventional importance-weighting approaches suffer from excessive variance. To circumvent this variance issue, we propose a new estimator, called OffCEM, that is based on the conjunct effect model (CEM), a novel decomposition of the causal effect into a cluster effect and a residual effect. OffCEM applies importance weighting only to action clusters and addresses the residual causal effect through model-based reward estimation. We show that the proposed estimator is unbiased under a new condition, called local correctness, which only requires that the residual-effect model preserves the relative expected reward differences of the actions within each cluster. To best leverage the CEM and local correctness, we also propose a new two-step procedure for performing model-based estimation that minimizes bias in the first step and variance in the second step. We find that the resulting OffCEM estimator substantially improves bias and variance compared to a range of conventional estimators. Experiments demonstrate that OffCEM provides substantial improvements in OPE especially in the presence of many actions.
翻译:我们研究上下文赌博机策略在大离散动作空间中的离策略评估(OPE),传统重要性加权方法在此场景下存在方差过大的问题。为规避此方差问题,我们提出一种名为OffCEM的新型估计器,其基于联合效应模型(CEM)——一种将因果效应分解为聚类效应与残差效应的创新性分解方法。OffCEM仅对动作聚类应用重要性加权,并通过基于模型的奖励估计处理残差因果效应。我们证明,在新提出的"局部正确性"条件下,该估计器具有无偏性——该条件仅要求残差效应模型能保持每个聚类内动作相对期望奖励差值的排序。为充分挖掘CEM与局部正确性的潜力,我们还提出一种新的两步式模型估计流程,该流程在第一阶段最小化偏差,在第二阶段最小化方差。实验表明,与多种传统估计器相比,OffCEM估计器在偏差与方差方面均有显著改进,尤其在动作数量众多时能为OPE带来实质性提升。