Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection

The rapid entry of machine learning approaches in our daily activities and high-stakes domains demands transparency and scrutiny of their fairness and reliability. To help gauge machine learning models' robustness, research typically focuses on the massive datasets used for their deployment, e.g., creating and maintaining documentation for understanding their origin, process of development, and ethical considerations. However, data collection for AI is still typically a one-off practice, and oftentimes datasets collected for a certain purpose or application are reused for a different problem. Additionally, dataset annotations may not be representative over time, contain ambiguous or erroneous annotations, or be unable to generalize across issues or domains. Recent research has shown these practices might lead to unfair, biased, or inaccurate outcomes. We argue that data collection for AI should be performed in a responsible manner where the quality of the data is thoroughly scrutinized and measured through a systematic set of appropriate metrics. In this paper, we propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics for an iterative in-depth analysis of the factors influencing the quality and reliability} of the generated data. We propose a granular set of measurements to inform on the internal reliability of a dataset and its external stability over time. We validate our approach across nine existing datasets and annotation tasks and four content modalities. This approach impacts the assessment of data robustness used for AI applied in the real world, where diversity of users and content is eminent. Furthermore, it deals with fairness and accountability aspects in data collection by providing systematic and transparent quality analysis for data collections.

翻译：摘要：机器学习方法迅速渗透至日常活动与高风险领域，要求对其公平性与可靠性进行透明化审视。为评估机器学习模型的鲁棒性，研究通常聚焦于模型部署所依赖的大规模数据集，例如创建并维护文档以理解数据集的来源、开发过程及伦理考量。然而，人工智能的数据收集仍常为一次性实践，且为特定目的或应用收集的数据集往往被重复用于不同问题。此外，数据集注释可能随时间推移失去代表性，存在模糊或错误标注，或无法跨领域或泛化。近期研究表明，此类实践可能导致不公平、有偏见或不准确的结果。我们主张，人工智能数据收集应以负责任的方式进行，通过系统性指标集对数据质量进行彻底审查与度量。本文提出一种负责任人工智能方法论，旨在通过迭代深入分析影响生成数据质量与可靠性的因素，利用指标集指导数据收集。我们提出一组精细化度量，用于揭示数据集的内在可靠性及其随时间的稳定性的外部表现。该方法在九个现有数据集与标注任务及四种内容模态上得到验证。这一方法对应用于现实世界（用户与内容多样性显著）的人工智能数据鲁棒性评估具有影响，并通过提供系统化、透明的数据收集质量分析，处理了公平性与问责性方面的考量。