Recommendation has become a prominent area of research in the field of Information Retrieval (IR). Evaluation is also a traditional research topic in this community. Motivated by a few counter-intuitive observations reported in recent studies, this perspectives paper takes a fresh look at recommender systems from an evaluation standpoint. Rather than examining metrics like recall, hit rate, or NDCG, or perspectives like novelty and diversity, the key focus here is on how these metrics are calculated when evaluating a recommender algorithm. Specifically, the commonly used train/test data splits and their consequences are re-examined. We begin by examining common data splitting methods, such as random split or leave-one-out, and discuss why the popularity baseline is poorly defined under such splits. We then move on to explore the two implications of neglecting a global timeline during evaluation: data leakage and oversimplification of user preference modeling. Afterwards, we present new perspectives on recommender systems, including techniques for evaluating algorithm performance that more accurately reflect real-world scenarios, and possible approaches to consider decision contexts in user preference modeling.
翻译:推荐已成为信息检索(IR)领域的一个重要研究方向。评估同样是该领域的传统研究课题。受近期研究中几项反直觉观察结果的启发,这篇观点性文章从评估视角重新审视了推荐系统。本文的核心关注点并非召回率、命中率或NDCG等指标,也非新颖性或多样性等视角,而是这些指标在评估推荐算法时如何计算的问题。具体而言,我们重新审视了常用的训练/测试数据划分方法及其后果。首先,我们分析随机划分或留一法等常见数据划分方式,并探讨为何在这种划分下流行度基准的定义存在缺陷。随后,我们进一步探讨评估过程中忽略全局时间线带来的两方面影响:数据泄露与用户偏好建模的过度简化。最后,我们提出关于推荐系统的新视角,包括更准确反映真实场景的算法性能评估技术,以及在用户偏好建模中考虑决策上下文的可能方法。