On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms

Artificial Intelligence (AI) has made its way into various scientific fields, providing astonishing improvements over existing algorithms for a wide variety of tasks. In recent years, there have been severe concerns over the trustworthiness of AI technologies. The scientific community has focused on the development of trustworthy AI algorithms. However, machine and deep learning algorithms, popular in the AI community today, depend heavily on the data used during their development. These learning algorithms identify patterns in the data, learning the behavioral objective. Any flaws in the data have the potential to translate directly into algorithms. In this study, we discuss the importance of Responsible Machine Learning Datasets and propose a framework to evaluate the datasets through a responsible rubric. While existing work focuses on the post-hoc evaluation of algorithms for their trustworthiness, we provide a framework that considers the data component separately to understand its role in the algorithm. We discuss responsible datasets through the lens of fairness, privacy, and regulatory compliance and provide recommendations for constructing future datasets. After surveying over 100 datasets, we use 60 datasets for analysis and demonstrate that none of these datasets is immune to issues of fairness, privacy preservation, and regulatory compliance. We provide modifications to the ``datasheets for datasets" with important additions for improved dataset documentation. With governments around the world regularizing data protection laws, the method for the creation of datasets in the scientific community requires revision. We believe this study is timely and relevant in today's era of AI.

翻译：人工智能已渗透至多个科学领域，在各类任务中展现出对现有算法的显著改进。近年来，AI技术的可信度问题引发严重关切。科学界致力于开发可信AI算法，但当前AI领域主流的机器学习和深度学习算法高度依赖其开发过程中使用的数据。这些学习算法通过识别数据中的模式来习得行为目标，数据中的任何缺陷都可能直接转化为算法漏洞。本研究探讨了负责任机器学习数据集的重要性，并提出了一套基于责任准则的数据集评估框架。现有研究多聚焦于算法可信度的后验评估，而我们提出的框架则独立考量数据组件以理解其在算法中的作用。我们从公平性、隐私保护和合规性三个维度审视负责任数据集，并为未来数据集构建提供建议。通过调研超过100个数据集，最终选取60个数据集进行分析，结果表明所有数据集均存在公平性、隐私保护及合规性问题。我们对数据文档规范提出改进建议，通过关键补充完善了"数据集说明书"框架。随着全球各国纷纷出台数据保护法规，科学界的数据集构建方法亟需革新。本研究在当下AI时代具有重要时效性与现实意义。