In the rapidly evolving domain of Recommender Systems (RecSys), new algorithms frequently claim state-of-the-art performance based on evaluations over a limited set of arbitrarily selected datasets. However, this approach may fail to holistically reflect their effectiveness due to the significant impact of dataset characteristics on algorithm performance. Addressing this deficiency, this paper introduces a novel benchmarking methodology to facilitate a fair and robust comparison of RecSys algorithms, thereby advancing evaluation practices. By utilizing a diverse set of $30$ open datasets, including two introduced in this work, and evaluating $11$ collaborative filtering algorithms across $9$ metrics, we critically examine the influence of dataset characteristics on algorithm performance. We further investigate the feasibility of aggregating outcomes from multiple datasets into a unified ranking. Through rigorous experimental analysis, we validate the reliability of our methodology under the variability of datasets, offering a benchmarking strategy that balances quality and computational demands. This methodology enables a fair yet effective means of evaluating RecSys algorithms, providing valuable guidance for future research endeavors.
翻译:在推荐系统(RecSys)这一快速发展的领域中,新算法通常基于有限且任意选取的数据集评估来宣称其最先进的性能。然而,由于数据集特征对算法性能的显著影响,这种方法可能无法全面反映算法的有效性。针对这一不足,本文引入了一种新颖的基准测试方法,以促进对推荐系统算法的公平且稳健的比较,从而推进评估实践。通过利用包含30个开放数据集的多样化集合(其中两个为本研究新引入的数据集),并基于9个评估指标对11种协同过滤算法进行评测,我们系统审视了数据集特征对算法性能的影响。我们进一步探究了将多个数据集的结果聚合为统一排名的可行性。通过严谨的实验分析,我们验证了该方法在数据集多样性条件下的可靠性,提供了一种平衡质量与计算需求的基准测试策略。该方法为推荐系统算法提供了公平而有效的评估手段,并为未来研究提供了宝贵指导。