The off-policy paradigm casts recommendation as a counterfactual decision-making task, allowing practitioners to unbiasedly estimate online metrics using offline data. This leads to effective evaluation metrics, as well as learning procedures that directly optimise online success. Nevertheless, the high variance that comes with unbiasedness is typically the crux that complicates practical applications. An important insight is that the difference between policy values can often be estimated with significantly reduced variance, if said policies have positive covariance. This allows us to formulate a pairwise off-policy estimation task: $\Delta\text{-}{\rm OPE}$. $\Delta\text{-}{\rm OPE}$ subsumes the common use-case of estimating improvements of a learnt policy over a production policy, using data collected by a stochastic logging policy. We introduce $\Delta\text{-}{\rm OPE}$ methods based on the widely used Inverse Propensity Scoring estimator and its extensions. Moreover, we characterise a variance-optimal additive control variate that further enhances efficiency. Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.
翻译:离线估计范式将推荐系统重构为反事实决策任务,使得研究者能够利用离线数据无偏地估计在线指标。这既产生了有效的评估指标,也催生了直接优化在线表现的学习过程。然而,无偏性伴随的高方差通常是制约实际应用的关键问题。一个重要洞见是,若策略间存在正协方差,则策略价值的差值往往能以显著降低的方差进行估计。这使我们得以构建成对策略的离线估计任务:$\Delta\text{-}{\rm OPE}$。$\Delta\text{-}{\rm OPE}$涵盖了常见应用场景——利用随机日志策略收集的数据,估计学习策略相较于生产策略的改进程度。我们基于广泛使用的逆倾向得分估计器及其扩展方法,提出了$\Delta\text{-}{\rm OPE}$方法。此外,我们刻画了一种方差最优的加性控制变量,进一步提升了估计效率。仿真实验、离线实验与在线实验均表明:我们的方法在评估任务与学习任务中均显著提升了性能表现。