We identify the task of measuring data to quantitatively characterize the composition of machine learning data and datasets. Similar to an object's height, width, and volume, data measurements quantify different attributes of data along common dimensions that support comparison. Several lines of research have proposed what we refer to as measurements, with differing terminology; we bring some of this work together, particularly in fields of computer vision and language, and build from it to motivate measuring data as a critical component of responsible AI development. Measuring data aids in systematically building and analyzing machine learning (ML) data towards specific goals and gaining better control of what modern ML systems will learn. We conclude with a discussion of the many avenues of future work, the limitations of data measurements, and how to leverage these measurement approaches in research and practice.
翻译:我们明确了测量数据的任务,旨在定量描述机器学习数据与数据集的构成特征。正如物体的高度、宽度和体积,数据测量沿支持比较的通用维度量化了数据的不同属性。多个研究方向提出了我们称之为“测量”的方法,尽管术语各异;我们整合了部分相关工作,特别是计算机视觉和语言领域的研究成果,并由此论证将数据测量作为负责任AI开发的关键组成部分。数据测量有助于系统化构建与分析机器学习数据以实现特定目标,并增强对现代机器学习系统学习内容的可控性。最后,我们讨论了未来工作的诸多方向、数据测量的局限性,以及如何在研究实践中运用这些测量方法。