Countless new machine learning models are published every year and are reported to significantly advance the state-of-the-art in top-n recommendation. However, earlier reproducibility studies indicate that progress in this area may be quite limited, due to widespread methodological issues, e.g., comparisons with untuned baseline models, creating an illusion of progress. In this work, we examine whether these problems persist in today's research by attempting to reproduce nine SIGIR 2023 and 2024 recommendation algorithms based on Denoising Diffusion Probabilistic Models, a recent but rapidly expanding research area. Only 25% of reported results are fully reproducible and, since the original papers relied on weak baselines, they do not establish the superiority of diffusion models over state-of-the-art methods. In our controlled evaluations, well-tuned simpler baselines consistently exceed the diffusion-based models' effectiveness reported in the original papers. Furthermore, we identify key mismatches between the characteristics of diffusion models and those of the traditional top-n recommendation task, raising doubts about their suitability for recommendation. Moreover, in the analyzed papers, the generative capabilities of these models are constrained to a minimum. Overall, our results call for greater scientific rigor and a disruptive change in the research and publication culture in this area.
翻译:每年有无数新型机器学习模型问世,并宣称在Top-N推荐领域显著提升了当前最优性能。然而,早期的可重复性研究表明,由于普遍存在的方法论问题(例如与未调优的基线模型进行比较),该领域的实际进展可能相当有限,从而形成进展幻象。本研究聚焦于基于去噪扩散概率模型(近年快速发展的研究领域)的算法,通过尝试复现九篇SIGIR 2023及2024论文中的推荐算法,检验此类问题是否持续存在于当今研究中。结果仅有25%的报告结果可完全复现,且原始论文依赖弱基线模型,未能证明扩散模型相较于当前最优方法的优越性。在我们可控的评估中,经过充分调优的简单基线模型持续超越原始论文中扩散模型报告的有效性。此外,我们识别出扩散模型特征与经典Top-N推荐任务之间的关键错位,对其在推荐场景中的适用性提出质疑。同时,在分析论文中,这些模型的生成能力被限制在最低程度。总体而言,我们的研究结果呼吁该领域需加强科学严谨性,并推动研究与出版文化的颠覆性变革。