To ensure the fairness and trustworthiness of machine learning (ML) systems, recent legislative initiatives and relevant research in the ML community have pointed out the need to document the data used to train ML models. Besides, data-sharing practices in many scientific domains have evolved in recent years for reproducibility purposes. In this sense, academic institutions' adoption of these practices has encouraged researchers to publish their data and technical documentation in peer-reviewed publications such as data papers. In this study, we analyze how this broader scientific data documentation meets the needs of the ML community and regulatory bodies for its use in ML technologies. We examine a sample of 4041 data papers of different domains, assessing their completeness, coverage of the requested dimensions, and trends in recent years. We focus on the most and least documented dimensions and compare the results with those of an ML-focused venue (NeurIPS D&B track) publishing papers describing datasets. As a result, we propose a set of recommendation guidelines for data creators and scientific data publishers to increase their data's preparedness for its transparent and fairer use in ML technologies.
翻译:为确保机器学习(ML)系统的公平性与可信度,近期的立法倡议及ML学界相关研究均指出,有必要对用于训练ML模型的数据进行规范化记录。此外,为满足可重复性需求,近年来许多科学领域的数据共享实践也持续发展。在此背景下,学术机构对这些实践的采纳已促使研究人员通过数据论文等同行评审出版物公开其数据与技术文档。本研究分析了此类更广泛的科学数据文档如何满足ML学界及监管机构对数据在ML技术中应用的需求。我们通过对涵盖多领域的4041篇数据论文进行抽样,评估其完整性、所需维度的覆盖度以及近年发展趋势,重点关注记录最充分与最欠缺的维度,并将结果与专注于ML的学术会议(NeurIPS D&B专题)中发表的数据集描述论文进行对比。基于此,我们为数据创建者与科学数据发布者提出了一套建议性指南,以提升其数据在ML技术中实现透明与公平应用的准备度。