On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-$n$ Recommendation

Approaches to recommendation are typically evaluated in one of two ways: (1) via a (simulated) online experiment, often seen as the gold standard, or (2) via some offline evaluation procedure, where the goal is to approximate the outcome of an online experiment. Several offline evaluation metrics have been adopted in the literature, inspired by ranking metrics prevalent in the field of Information Retrieval. (Normalised) Discounted Cumulative Gain (nDCG) is one such metric that has seen widespread adoption in empirical studies, and higher (n)DCG values have been used to present new methods as the state-of-the-art in top-$n$ recommendation for many years. Our work takes a critical look at this approach, and investigates when we can expect such metrics to approximate the gold standard outcome of an online experiment. We formally present the assumptions that are necessary to consider DCG an unbiased estimator of online reward and provide a derivation for this metric from first principles, highlighting where we deviate from its traditional uses in IR. Importantly, we show that normalising the metric renders it inconsistent, in that even when DCG is unbiased, ranking competing methods by their normalised DCG can invert their relative order. Through a correlation analysis between off- and on-line experiments conducted on a large-scale recommendation platform, we show that our unbiased DCG estimates strongly correlate with online reward, even when some of the metric's inherent assumptions are violated. This statement no longer holds for its normalised variant, suggesting that nDCG's practical utility may be limited.

翻译：推荐方法的评估通常通过以下两种方式进行：(1) 通过（模拟）在线实验，这通常被视为黄金标准；或(2) 通过某种离线评估流程，其目标是近似在线实验的结果。文献中已采用多种受信息检索领域常用排序指标启发的离线评估指标。（归一化）折损累计增益（nDCG）便是其中一种在实证研究中被广泛采用的指标，多年来更高的（n）DCG值常被用于展示新方法在Top-$n$推荐领域达到最先进水平。本研究对此评估方式进行了批判性审视，并探讨了此类指标在何种条件下能够近似在线实验的黄金标准结果。我们形式化地提出了将DCG视为在线奖励无偏估计量所需的前提假设，并从第一性原理推导了该指标，重点指出了其与传统信息检索应用场景的差异。关键的是，我们证明了指标的归一化会导致其不一致性：即使DCG本身是无偏的，按归一化DCG对竞争方法排序也可能逆转它们的相对顺序。通过对大规模推荐平台进行的离线和在线实验的相关性分析，我们发现即使在该指标某些固有假设被违反的情况下，我们提出的无偏DCG估计仍与在线奖励强相关。这一结论对其归一化变体不再成立，表明nDCG的实际效用可能有限。

相关内容

DCG

关注 0

《离散与计算几何》(DCG)是一份国际数学与计算机科学杂志，涵盖了广泛的主题，其中几何在其中扮演着重要的角色。它发表几何论文的主题：多边形、空间细分、填充、覆盖和平铺、配置和排列以及几何图形;几何算法及其复杂性、凸壳、Voronoi图、Delaunay三角剖分和范围搜索;立体建模、计算机图形学、图像处理、模式识别和运动规划;计算拓扑，离散微分几何，几何概率，和真实代数几何。该杂志还接受在图论、数学编程、组合优化、代数几何、数字几何、晶体学、数据分析、机器学习和机器人等领域具有独特几何风格的论文。该杂志还鼓励其他材料，如短视频、动画图形和类似的电子补充材料。官网地址：http://dblp.uni-trier.de/db/journals/dcg/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日