Data is a critical resource for machine learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that creates a shared representation across ML tools, frameworks, and platforms. Croissant makes datasets more discoverable, portable, and interoperable, thereby addressing significant challenges in ML data management. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, enabling easy loading into the most commonly-used ML frameworks, regardless of where the data is stored. Our initial evaluation by human raters shows that Croissant metadata is readable, understandable, complete, yet concise.
翻译:数据是机器学习(ML)的关键资源,然而数据处理仍是当前的主要瓶颈之一。本文提出Croissant,一种面向数据集的元数据格式,旨在为各类机器学习工具、框架与平台建立统一的共享表示。Croissant能够显著提升数据集的发现性、可移植性与互操作性,从而有效应对机器学习数据管理中的核心挑战。目前已有多个主流数据集仓库支持Croissant格式,覆盖数十万数据集,使得无论数据存储于何处,都能便捷地加载至最常用的机器学习框架。通过人工评估者的初步验证表明,Croissant元数据具备良好的可读性、易理解性、完整性与简洁性。