DataPerf: Benchmarks for Data-Centric AI Development

Mark Mazumder,Colby Banbury,Xiaozhe Yao,Bojan Karlaš,William Gaviria Rojas,Sudnya Diamos,Greg Diamos,Lynn He,Alicia Parrish,Hannah Rose Kirk,Jessica Quaye,Charvi Rastogi,Douwe Kiela,David Jurado,David Kanter,Rafael Mosquera,Juan Ciro,Lora Aroyo,Bilge Acun,Lingjiao Chen,Mehul Smriti Raje,Max Bartolo,Sabri Eyuboglu,Amirata Ghorbani,Emmett Goodman,Oana Inel,Tariq Kane,Christine R. Kirkpatrick,Tzu-Sheng Kuo,Jonas Mueller,Tristan Thrush,Joaquin Vanschoren,Margaret Warren,Adina Williams,Serena Yeung,Newsha Ardalani,Praveen Paritosh,Lilith Bath-Leah,Ce Zhang,James Zou,Carole-Jean Wu,Cody Coleman,Andrew Ng,Peter Mattson,Vijay Janapa Reddi

Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks covering a wide spectrum of data-centric techniques, tasks, and modalities in vision, speech, acquisition, debugging, and diffusion prompting, and we support hosting new contributed benchmarks from the community. The benchmarks, online evaluation platform, and baseline implementations are open source, and the MLCommons Association will maintain DataPerf to ensure long-term benefits to academia and industry.

翻译：机器学习研究长期聚焦于模型而非数据集，主流数据集被用于通用机器学习任务时，往往未充分考虑基础问题的广度、难度与保真度。对数据基础性重要作用的忽视导致实际应用中出现不准确、偏差与脆弱性，而现有数据集基准测试的饱和状态也阻碍了研究进展。为此，我们提出DataPerf——一个由社区主导的基准测试套件，用于评估机器学习数据集及以数据为中心的算法。旨在通过竞争性、可比性与可复现性推动以数据为中心的人工智能创新。我们使机器学习社区能够迭代数据集（而非仅架构），并提供包含多轮挑战的开放式在线平台以支持这一迭代式开发。DataPerf的首个版本包含五项基准测试，覆盖视觉、语音、数据获取、调试及扩散提示等广泛的数据中心技术、任务与模态，同时支持社区提交新的基准测试。这些基准测试、在线评估平台及基线实现均以开源形式发布，MLCommons协会将维护DataPerf以确保学术界与工业界的长期收益。