The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track

Data curation is a field with origins in librarianship and archives, whose scholarship and thinking on data issues go back centuries, if not millennia. The field of machine learning is increasingly observing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models - evidenced not least by the creation of the Datasets and Benchmarks track itself. This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation. We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit developed through a literature review of data curation principles. We use the framework to assess the strengths and weaknesses in current dataset development practices of 60 datasets published in the NeurIPS Datasets and Benchmarks track from 2021-2023. We summarize key findings and trends. Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management. We suggest targeted strategies and resources to improve documentation in these areas and provide recommendations for the NeurIPS peer-review process that prioritize rigorous data curation in ML. Finally, we provide results in the format of a dataset that showcases aspects of recommended data curation practices. Our rubric and results are of interest for improving data curation practices broadly in the field of ML as well as to data curation and science and technology studies scholars studying practices in ML. Our aim is to support continued improvement in interdisciplinary research on dataset practices, ultimately improving the reusability and reproducibility of new datasets and benchmarks, enabling standardized and informed human oversight, and strengthening the foundation of rigorous and responsible ML research.

翻译：数据管理是一个源自学图书馆学与档案学的领域，其关于数据问题的学术思想可追溯至数百甚至数千年前。机器学习领域日益认识到数据管理对于应用推进及模型基础理解的重要性——Datasets and Benchmarks赛道的设立本身即为明证。本研究通过数据管理的视角，对NeurIPS会议中的数据集开发实践进行分析。我们提出一个数据集文档评估框架，该框架包含通过数据管理原则文献综述开发的评估量规与工具包。我们运用此框架评估了2021-2023年间发表于NeurIPS Datasets and Benchmarks赛道的60个数据集在当前开发实践中的优势与不足，并总结了关键发现与趋势。结果表明，当前实践在环境足迹、伦理考量和数据管理方面的文档记录存在显著不足。我们针对这些领域提出改进文档记录的具体策略与资源，并为NeurIPS同行评审流程提供优先考虑严谨数据管理的建议。最后，我们以数据集形式呈现研究结果，展示推荐数据管理实践的关键维度。本研究的评估量规与结果不仅有助于提升机器学习领域整体的数据管理实践，也对研究机器学习实践的数据管理及科学技术学者具有参考价值。我们的目标是支持数据集实践跨学科研究的持续改进，最终提升新数据集与基准的可复用性与可复现性，实现标准化与知情化的人工监督，并为严谨负责的机器学习研究奠定坚实基础。