Thanks to the great interest posed by researchers and companies, recommendation systems became a cornerstone of machine learning applications. However, concerns have arisen recently about the need for reproducibility, making it challenging to identify suitable pipelines. Several frameworks have been proposed to improve reproducibility, covering the entire process from data reading to performance evaluation. Despite this effort, these solutions often overlook the role of data management, do not promote interoperability, and neglect data analysis despite its well-known impact on recommender performance. To address these gaps, we propose DataRec, which facilitates using and manipulating recommendation datasets. DataRec supports reading and writing in various formats, offers filtering and splitting techniques, and enables data distribution analysis using well-known metrics. It encourages a unified approach to data manipulation by allowing data export in formats compatible with several recommendation frameworks.
翻译:得益于研究人员和公司的高度关注,推荐系统已成为机器学习应用的基石。然而,近期关于可复现性的需求引发了担忧,使得确定合适的处理流程变得困难。已有若干框架被提出以提升可复现性,覆盖了从数据读取到性能评估的完整流程。尽管付出了这些努力,现有方案仍常忽视数据管理的作用,未能促进互操作性,且忽略了数据分析——尽管后者对推荐器性能的影响已广为人知。为弥补这些不足,我们提出了DataRec,该框架旨在简化和规范推荐数据集的使用与处理。DataRec支持多种格式的读写操作,提供过滤与分割技术,并允许使用成熟指标进行数据分布分析。通过支持导出与多种推荐框架兼容的数据格式,它鼓励采用统一的数据处理方法。