Off-policy evaluation (OPE) is essential for assessing ranking and recommendation systems without costly online interventions. Self-Normalised Inverse Propensity Scoring (SNIPS) is a standard tool for variance reduction in OPE, leveraging a multiplicative control variate. Recent advances in off-policy learning suggest that additive control variates (baseline corrections) may offer superior performance, yet theoretical guarantees for evaluation are lacking. This paper provides a definitive answer: we prove that $β^\star$-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific -- but generally sub-optimal -- additive baseline. Our results theoretically justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation.
翻译:离策略评估(OPE)对于在不进行高成本在线干预的情况下评估排序与推荐系统至关重要。自归一化逆倾向得分加权(SNIPS)是OPE中一种标准的方差缩减工具,它利用了乘法控制变量。近期离策略学习的研究进展表明,加法控制变量(基线校正)可能提供更优的性能,但评估方面的理论保证尚不完善。本文给出了一个明确的答案:我们证明了具有最优加法基线的估计器 $β^\star$-IPS 在均方误差意义上渐近优于SNIPS。通过对方差差距进行解析分解,我们表明SNIPS渐近等价于使用一个特定的——但通常非最优的——加法基线。我们的结果从理论上证明了在排序和推荐任务中,从自归一化方法转向最优基线校正的合理性。