The off-policy learning paradigm allows for recommender systems and general ranking applications to be framed as decision-making problems, where we aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric. With unbiasedness comes potentially high variance, and prevalent methods exist to reduce estimation variance. These methods typically make use of control variates, either additive (i.e., baseline corrections or doubly robust methods) or multiplicative (i.e., self-normalisation). Our work unifies these approaches by proposing a single framework built on their equivalence in learning scenarios. The foundation of our framework is the derivation of an equivalent baseline correction for all of the existing control variates. Consequently, our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it. This optimal estimator brings significantly improved performance in both evaluation and learning, and minimizes data requirements. Empirical observations corroborate our theoretical findings.
翻译:离线学习范式允许推荐系统及通用排名应用被建模为决策问题,其目标在于学习能够优化在线奖励指标无偏离线估计的决策策略。无偏性虽能消除偏差,却可能带来高方差问题,现有方法主要通过控制变量来降低估计方差——包括加法型(即基线校正或双重稳健方法)与乘法型(即自归一化方法)。本研究通过论证各类控制变量在学习场景中的等价性,提出统一框架整合现有方法。该框架的核心在于推导出适用于所有现有控制变量的等效基线校正方法。据此,我们得以刻画方差最优无偏估计量的特征,并给出其闭式解。该最优估计量在评估与学习过程中均显著提升性能,同时最小化数据需求。实验结果验证了理论分析的正确性。