Our Model Achieves Excellent Performance on MovieLens: What Does it Mean?

A typical benchmark dataset for recommender system (RecSys) evaluation consists of user-item interactions generated on a platform within a time period. The interaction generation mechanism partially explains why a user interacts with (e.g., like, purchase, rate) an item, and the context of when a particular interaction happened. In this study, we conduct a meticulous analysis of the MovieLens dataset and explain the potential impact of using the dataset for evaluating recommendation algorithms. We make a few main findings from our analysis. First, there are significant differences in user interactions at the different stages when a user interacts with the MovieLens platform. The early interactions largely define the user portrait which affects the subsequent interactions. Second, user interactions are highly affected by the candidate movies that are recommended by the platform's internal recommendation algorithm(s). Third, changing the order of user interactions makes it more difficult for sequential algorithms to capture the progressive interaction process. We further discuss the discrepancy between the interaction generation mechanism that is employed by the MovieLens system and that of typical real-world recommendation scenarios. In summary, the MovieLens platform demonstrates an efficient and effective way of collecting user preferences to address cold-starts. However, models that achieve excellent recommendation accuracy on the MovieLens dataset may not demonstrate superior performance in practice, for at least two kinds of differences: (i) the differences in the contexts of user-item interaction generation, and (ii) the differences in user knowledge about the item collections. While results on MovieLens can be useful as a reference, they should not be solely relied upon as the primary justification for the effectiveness of a recommendation system model.

翻译：推荐系统（RecSys）评估的典型基准数据集包含平台上一段时间内生成的用户-物品交互记录。交互生成机制部分解释了用户为何与物品交互（如点赞、购买、评分），以及特定交互发生时的上下文。本研究对MovieLens数据集进行了细致分析，并阐述了使用该数据集评估推荐算法的潜在影响。分析得出几个主要发现：首先，用户在MovieLens平台不同阶段的交互存在显著差异，早期交互在很大程度上定义了用户画像，进而影响后续交互。其次，用户交互严重受平台内部推荐算法推荐的候选电影影响。第三，改变用户交互顺序会使序列算法更难捕捉渐进的交互过程。我们进一步讨论了MovieLens系统采用的交互生成机制与典型真实推荐场景之间的差异。总之，MovieLens平台提供了一种高效收集用户偏好以应对冷启动问题的方法。然而，在MovieLens数据集上取得优异推荐准确率的模型可能在实际应用中无法展示出卓越性能，这至少源于两类差异：（i）用户-物品交互生成上下文的差异，以及（ii）用户对物品集合认知的差异。尽管MovieLens上的结果可作为有用参考，但不应将其作为推荐系统模型有效性的主要依据。