When Newer is Not Better: Does Deep Learning Really Benefit Recommendation From Implicit Feedback?

In recent years, neural models have been repeatedly touted to exhibit state-of-the-art performance in recommendation. Nevertheless, multiple recent studies have revealed that the reported state-of-the-art results of many neural recommendation models cannot be reliably replicated. A primary reason is that existing evaluations are performed under various inconsistent protocols. Correspondingly, these replicability issues make it difficult to understand how much benefit we can actually gain from these neural models. It then becomes clear that a fair and comprehensive performance comparison between traditional and neural models is needed. Motivated by these issues, we perform a large-scale, systematic study to compare recent neural recommendation models against traditional ones in top-n recommendation from implicit data. We propose a set of evaluation strategies for measuring memorization performance, generalization performance, and subgroup-specific performance of recommendation models. We conduct extensive experiments with 13 popular recommendation models (including two neural models and 11 traditional ones as baselines) on nine commonly used datasets. Our experiments demonstrate that even with extensive hyper-parameter searches, neural models do not dominate traditional models in all aspects, e.g., they fare worse in terms of average HitRate. We further find that there are areas where neural models seem to outperform non-neural models, for example, in recommendation diversity and robustness between different subgroups of users and items. Our work illuminates the relative advantages and disadvantages of neural models in recommendation and is therefore an important step towards building better recommender systems.

翻译：近年来，神经模型屡次被宣称在推荐系统中展现出最先进的性能。然而，多项最新研究揭示，许多神经推荐模型所报告的最优结果无法可靠复现。其主要原因在于现有评估是在多种不一致的协议下进行的。这种可复现性问题导致我们难以理解实际能从这些神经模型中获得多少益处。因此，对传统模型与神经模型进行公平且全面的性能比较显得尤为必要。基于此，我们开展了一项大规模系统研究，针对隐式数据中的Top-N推荐任务，比较近期神经推荐模型与传统模型的性能。我们提出了一套评估策略，用于衡量推荐模型的记忆性能、泛化性能以及子群体特定性能。我们在九个常用数据集上，对13种流行推荐模型（包括两种神经模型和11种作为基线的传统模型）进行了广泛实验。实验表明，即使经过广泛的超参数搜索，神经模型也并未在所有方面超越传统模型——例如，在平均命中率（HitRate）上表现更差。我们进一步发现，神经模型在某些领域似乎优于非神经模型，例如推荐多样性以及在不同用户和物品子群体间的鲁棒性。本研究阐明了神经模型在推荐中的相对优势与不足，因此是构建更优推荐系统的重要一步。