The off-policy learning paradigm allows for recommender systems and general ranking applications to be framed as decision-making problems, where we aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric. With unbiasedness comes potentially high variance, and prevalent methods exist to reduce estimation variance. These methods typically make use of control variates, either additive (i.e., baseline corrections or doubly robust methods) or multiplicative (i.e., self-normalisation). Our work unifies these approaches by proposing a single framework built on their equivalence in learning scenarios. The foundation of our framework is the derivation of an equivalent baseline correction for all of the existing control variates. Consequently, our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it. This optimal estimator brings significantly improved performance in both evaluation and learning, and minimizes data requirements. Empirical observations corroborate our theoretical findings.
翻译:离策略学习范式允许将推荐系统和一般排序应用构建为决策问题,其目标是学习能优化在线奖励指标无偏离线估计的决策策略。无偏性可能伴随高方差,现有主流方法致力于降低估计方差。这些方法通常利用控制变量,包括加法形式(即基线校正或双重稳健方法)与乘法形式(即自归一化方法)。本研究通过构建基于学习场景中等价性的统一框架,整合了这些方法。该框架的基础是为所有现有控制变量推导出等价基线校正。因此,我们的框架能够刻画方差最优无偏估计量,并提供其闭式解。该最优估计量在评估与学习阶段均带来显著性能提升,同时最小化数据需求。实证结果验证了我们的理论发现。