Data curation is a field with origins in librarianship and archives, whose scholarship and thinking on data issues go back centuries, if not millennia. The field of machine learning is increasingly observing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models - evidenced not least by the creation of the Datasets and Benchmarks track itself. This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation. We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit developed through a literature review of data curation principles. We use the framework to assess the strengths and weaknesses in current dataset development practices of 60 datasets published in the NeurIPS Datasets and Benchmarks track from 2021-2023. We summarize key findings and trends. Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management. We suggest targeted strategies and resources to improve documentation in these areas and provide recommendations for the NeurIPS peer-review process that prioritize rigorous data curation in ML. Finally, we provide results in the format of a dataset that showcases aspects of recommended data curation practices. Our rubric and results are of interest for improving data curation practices broadly in the field of ML as well as to data curation and science and technology studies scholars studying practices in ML. Our aim is to support continued improvement in interdisciplinary research on dataset practices, ultimately improving the reusability and reproducibility of new datasets and benchmarks, enabling standardized and informed human oversight, and strengthening the foundation of rigorous and responsible ML research.
翻译:数据管理是一个源于图书馆学与档案学的领域,其关于数据问题的学术思想可追溯至数百甚至数千年前。机器学习领域日益认识到数据管理对于应用推进及模型基础理解的重要性——数据集与基准赛道的设立本身即是最有力的证明。本研究通过数据管理的视角,对NeurIPS会议中的数据集开发实践进行分析。我们提出了一套数据集文档评估框架,包含通过数据管理原则文献综述开发的评估量规与工具包。运用该框架,我们对2021-2023年间发表于NeurIPS数据集与基准赛道的60个数据集进行了当前开发实践的优劣评估,并总结了关键发现与趋势。结果表明,当前实践在环境足迹、伦理考量及数据管理方面的文档记录存在显著不足。我们针对这些领域提出了具体的改进策略与资源建议,并为NeurIPS同行评审流程提供了优先考虑严谨数据管理的优化方案。最后,我们以结构化数据集的形式呈现研究结果,其中展示了推荐数据管理实践的关键要素。本研究的评估量规与成果不仅有助于全面提升机器学习领域的数据管理实践,也为关注机器学习实践的数据管理学者及科学技术研究者提供了参考。我们的目标是持续推动数据集实践的跨学科研究,最终提升新数据集与基准的可复用性与可复现性,实现标准化与知情化的人工监督,并为严谨负责的机器学习研究奠定坚实基础。