Approaches to recommendation are typically evaluated in one of two ways: (1) via a (simulated) online experiment, often seen as the gold standard, or (2) via some offline evaluation procedure, where the goal is to approximate the outcome of an online experiment. Several offline evaluation metrics have been adopted in the literature, inspired by ranking metrics prevalent in the field of Information Retrieval. (Normalised) Discounted Cumulative Gain (nDCG) is one such metric that has seen widespread adoption in empirical studies, and higher (n)DCG values have been used to present new methods as the state-of-the-art in top-$n$ recommendation for many years. Our work takes a critical look at this approach, and investigates when we can expect such metrics to approximate the gold standard outcome of an online experiment. We formally present the assumptions that are necessary to consider DCG an unbiased estimator of online reward and provide a derivation for this metric from first principles, highlighting where we deviate from its traditional uses in IR. Importantly, we show that normalising the metric renders it inconsistent, in that even when DCG is unbiased, ranking competing methods by their normalised DCG can invert their relative order. Through a correlation analysis between off- and on-line experiments conducted on a large-scale recommendation platform, we show that our unbiased DCG estimates strongly correlate with online reward, even when some of the metric's inherent assumptions are violated. This statement no longer holds for its normalised variant, suggesting that nDCG's practical utility may be limited.
翻译:推荐系统的评估通常通过两种方式进行:(1)基于(模拟)在线实验,常被视为黄金标准;(2)基于某种离线评估流程,目标是近似在线实验的结果。文献中已采用多种离线评估指标,这些指标受信息检索领域广泛使用的排序指标启发。(归一化)折扣累积增益(nDCG)是其中之一,已在实证研究中被广泛采用。多年来,更高的(n)DCG值被用于将新方法呈现为顶级Top-$n$推荐中的最新技术。我们的工作对这一方法进行了批判性审视,探究何时可以期望此类指标近似在线实验的黄金标准结果。我们正式提出了将DCG视为在线奖励无偏估计所必要的假设,并从基本原理出发推导了该指标,强调了其与信息检索传统使用方式的偏离之处。重要的是,我们证明了对指标进行归一化会导致其不一致性:即使DCG是无偏的,通过归一化DCG对竞争方法进行排序可能会颠倒它们的相对顺序。通过在大规模推荐平台上进行离线与在线实验的相关性分析,我们展示了即使在指标某些固有假设被违反的情况下,无偏DCG估计仍与在线奖励高度相关。然而,这一结论不再适用于其归一化变体,表明nDCG的实际效用可能有限。