Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets

Rafi Al Attrach,Rajna Fani,Sebastian Lobentanzer,Joan Giner-Miguelez,Debanshu Das,Varuni H. K.,Nobin Sarwar,Rajat Ghosh,Anwai Archit,Surbhi Motghare,Christina Conrad Parry,Luis Oala,Lara Grosso,Joaquin Vanschoren,Steffen Vogler,Sujata Goswami,Eric S. Rosenthal,Marzyeh Ghassemi,Matthew McDermott,Tom Pollard

from arxiv, 23 pages, 5 figures, 11 tables. Project: https://lcp.mit.edu/croissant-baker/ Code: https://github.com/MIT-LCP/croissant-baker

Croissant has emerged as the metadata standard for machine learning datasets, providing a structured, JSON-LD-based format that makes dataset discovery, automated ingestion, and reproducible analysis machine-checkable across ML platforms. Adoption has accelerated, and NeurIPS now requires Croissant metadata in every submission to its dataset tracks. Yet in practice Croissant generation usually starts with uploading data to a public platform, a path infeasible for governed and large local repositories that hold much of the high-value data ML increasingly relies on. We release Croissant Baker, a local-first, open-source command-line tool that generates validated Croissant metadata directly from a dataset directory through a modular handler registry. We evaluate Croissant Baker on over 140 datasets, scaling to MIMIC-IV at 886 million rows and 374 Parquet files. On held-out comparisons against producer-authored or standards-derived ground truth, Croissant Baker reaches 97-100% agreement across multiple domains.

翻译：Croissant已成为机器学习数据集的元数据标准，其基于JSON-LD的结构化格式使数据集发现、自动摄取和可复现分析在跨机器学习平台间具备机器可验证性。该标准的采用率持续攀升，NeurIPS现已要求其数据集轨道中的所有提交作品必须包含Croissant元数据。然而实践中，Croissant的生成通常始于将数据上传至公共平台——这种路径对于承载机器学习日益依赖的高价值数据的大型受控本地存储库而言并不可行。我们正式发布Croissant Baker这一本地优先的开源命令行工具，它通过模块化处理器注册表直接从数据集目录生成经过验证的Croissant元数据。我们在超过140个数据集上对Croissant Baker进行了评估，其处理规模可扩展至包含8.86亿行数据、374个Parquet文件的MIMIC-IV数据集。在与生成者撰写或基于标准构建的真实数据进行留出测试的对比中，Croissant Baker在多个领域达到97%-100%的一致率。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【斯坦福博士论文】AIGC：机器学习的合成数据生成与应用，155页pdf

专知会员服务

73+阅读 · 2024年1月24日