Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection

The rapid entry of machine learning approaches in our daily activities and high-stakes domains demands transparency and scrutiny of their fairness and reliability. To help gauge machine learning models' robustness, research typically focuses on the massive datasets used for their deployment, e.g., creating and maintaining documentation for understanding their origin, process of development, and ethical considerations. However, data collection for AI is still typically a one-off practice, and oftentimes datasets collected for a certain purpose or application are reused for a different problem. Additionally, dataset annotations may not be representative over time, contain ambiguous or erroneous annotations, or be unable to generalize across issues or domains. Recent research has shown these practices might lead to unfair, biased, or inaccurate outcomes. We argue that data collection for AI should be performed in a responsible manner where the quality of the data is thoroughly scrutinized and measured through a systematic set of appropriate metrics. In this paper, we propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics for an iterative in-depth analysis of the factors influencing the quality and reliability} of the generated data. We propose a granular set of measurements to inform on the internal reliability of a dataset and its external stability over time. We validate our approach across nine existing datasets and annotation tasks and four content modalities. This approach impacts the assessment of data robustness used for AI applied in the real world, where diversity of users and content is eminent. Furthermore, it deals with fairness and accountability aspects in data collection by providing systematic and transparent quality analysis for data collections.

翻译：机器学习方法迅速融入日常活动及高风险领域，要求对其公平性与可靠性保持透明并接受审视。为评估机器学习模型的稳健性，研究通常聚焦于部署所用的大规模数据集，例如创建并维护文档以理解其来源、开发流程及伦理考量。然而，当前人工智能的数据采集仍多为一次性实践，且针对特定目的或应用收集的数据集常被复用于不同问题。此外，数据集标注可能随时间推移失去代表性，包含模糊或错误标注，抑或无法跨问题或领域泛化。近年研究表明，此类实践可能导致不公平、有偏见或不准确的结果。我们主张，人工智能的数据采集应以负责任的方式进行，通过一套系统性指标对数据质量进行彻底审视与度量。本文提出一种负责任人工智能（RAI）方法论，旨在通过一组度量指标指导数据采集，对影响生成数据质量与可靠性的因素进行迭代深度分析。我们提出一套细粒度测量方案，以揭示数据集的内部可靠性及其随时间变化的外部稳定性。该方法在九个现有数据集与标注任务、四种内容模态中得到验证。其意义在于，在用户与内容多样性显著的真实世界人工智能应用中，提升了数据稳健性的评估能力。此外，该方法通过提供系统性、透明的数据质量分析，应对数据采集中的公平性与问责性挑战。