On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms

Artificial Intelligence (AI) has made its way into various scientific fields, providing astonishing improvements over existing algorithms for a wide variety of tasks. In recent years, there have been severe concerns over the trustworthiness of AI technologies. The scientific community has focused on the development of trustworthy AI algorithms. However, machine and deep learning algorithms, popular in the AI community today, depend heavily on the data used during their development. These learning algorithms identify patterns in the data, learning the behavioral objective. Any flaws in the data have the potential to translate directly into algorithms. In this study, we discuss the importance of Responsible Machine Learning Datasets and propose a framework to evaluate the datasets through a responsible rubric. While existing work focuses on the post-hoc evaluation of algorithms for their trustworthiness, we provide a framework that considers the data component separately to understand its role in the algorithm. We discuss responsible datasets through the lens of fairness, privacy, and regulatory compliance and provide recommendations for constructing future datasets. After surveying over 100 datasets, we use 60 datasets for analysis and demonstrate that none of these datasets is immune to issues of fairness, privacy preservation, and regulatory compliance. We provide modifications to the ``datasheets for datasets" with important additions for improved dataset documentation. With governments around the world regularizing data protection laws, the method for the creation of datasets in the scientific community requires revision. We believe this study is timely and relevant in today's era of AI.

翻译：人工智能（AI）已深入各个科学领域，为多种任务提供了优于现有算法的显著改进。近年来，AI技术的可信度引发了严重关切。科学界已聚焦于可信AI算法的研发，但当前AI领域广泛应用的机器与深度学习方法严重依赖其开发过程中使用的数据。这些学习算法通过识别数据中的模式来学习行为目标，数据中的任何缺陷都可能直接传导至算法。本研究探讨了负责任机器学习数据集的重要性，并提出了一套通过责任评估框架对数据集进行评价的方法。现有工作侧重于算法的事后可信度评估，而我们提供的框架则单独考虑数据组件，以理解其在算法中的作用。我们从公平性、隐私保护与监管合规的视角讨论负责任数据集，并为未来数据集构建提出建议。通过调查100余个数据集，我们选取60个数据集进行分析，结果表明所有数据集均无法避免公平性、隐私保护及监管合规方面的问题。我们对“数据集文档规范”进行了修订，增加了关键补充内容以改进数据集文档化流程。随着全球各国加强数据保护法规的制定，科学界的数据集创建方法亟需革新。我们相信本研究在当今AI时代具有及时性与重要现实意义。