In recommender systems, collecting, storing, and processing large-scale interaction data is increasingly costly in terms of time, energy, and computation, yet it remains unclear when additional data stops providing meaningful gains. This paper investigates how offline recommendation performance evolves as the size of the training dataset increases and whether a saturation point can be observed. We implemented a reproducible Python evaluation workflow with two established toolkits, LensKit and RecBole, included 11 large public datasets with at least 7 million interactions, and evaluated 10 tool-algorithm combinations. Using absolute stratified user sampling, we trained models on nine sample sizes from 100,000 to 100,000,000 interactions and measured NDCG@10. Overall, raw NDCG usually increased with sample size, with no observable saturation point. To make result groups comparable, we applied min-max normalization within each group, revealing a clear positive trend in which around 75% of the points at the largest completed sample size also achieved the group's best observed performance. A late-stage slope analysis over the final 10-30% of each group further supported this upward trend: the interquartile range remained entirely non-negative with a median near 1.0. In summary, for traditional recommender systems on typical user-item interaction data, incorporating more training data remains primarily beneficial, while weaker scaling behavior is concentrated in atypical dataset cases and in the algorithmic outlier RecBole BPR under our setup.
翻译:在推荐系统中,大规模交互数据的收集、存储和处理在时间、能源和计算方面成本日益增加,但尚不明确何时额外数据不再带来有意义的效果提升。本文研究离线推荐性能如何随训练数据集规模扩大而变化,以及能否观察到饱和点。我们利用LensKit和RecBole两个成熟工具包构建了可复现的Python评估流程,纳入11个包含至少700万次交互的大型公开数据集,评估了10种工具-算法组合。采用绝对分层用户采样法,在9个样本规模(从10万到1亿次交互)上训练模型,并测量NDCG@10指标。总体而言,原始NDCG通常随样本量增加而提升,未观察到明显饱和点。为使结果组具有可比性,我们对各组内数据进行最小-最大归一化,揭示了明确的增长趋势:约75%的最大完整样本规模点同时实现了组内最佳观测性能。对各组最后10-30%阶段的后期斜率分析进一步支持这一上升趋势:四分位距完全非负且中位数接近1.0。综上,在传统推荐系统处理典型用户-物品交互数据时,增加训练数据仍主要带来收益,弱扩展行为仅集中于非典型数据集案例及我们实验设置下的算法异常值RecBole BPR。