To ensure the fairness and trustworthiness of machine learning (ML) systems, recent legislative initiatives and relevant research in the ML community have pointed out the need to document the data used to train ML models. Besides, data-sharing practices in many scientific domains have evolved in recent years for reproducibility purposes. In this sense, the adoption of these practices by academic institutions has encouraged researchers to publish their data and technical documentation in peer-reviewed publications such as data papers. In this study, we analyze how this scientific data documentation meets the needs of the ML community and regulatory bodies for its use in ML technologies. We examine a sample of 4041 data papers of different domains, assessing their completeness and coverage of the requested dimensions, and trends in recent years, putting special emphasis on the most and least documented dimensions. As a result, we propose a set of recommendation guidelines for data creators and scientific data publishers to increase their data's preparedness for its transparent and fairer use in ML technologies.
翻译:为保障机器学习系统的公平性和可信度,近期立法倡议及机器学习领域的相关研究指出需对训练模型所用的数据进行文档化记录。与此同时,众多科学领域为追求可重复性,近年来已发展出数据共享实践。在此背景下,学术机构对这些实践的采纳促使研究人员在数据论文等同行评审出版物中公开其数据及技术文档。本研究分析此类科学数据文档在机器学习技术应用场景中,如何满足机器学习社区与监管机构的需求。我们以4041篇跨领域数据论文为样本,评估其文档完整性与各维度覆盖率,追踪近年来发展趋势,重点关注记录最充分与最薄弱的维度。基于研究结果,我们为数据创建者与科学数据出版者提出一套建议性指导方针,以提升其数据在机器学习技术中实现透明化与公平化应用的就绪程度。