Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, ready to be loaded into the most popular ML frameworks.
翻译:数据是机器学习(ML)的关键资源,然而数据处理仍是主要瓶颈之一。本文介绍Croissant,一种面向数据集的元数据格式,旨在简化ML工具与框架对数据的使用。Croissant通过提升数据集的发现性、可移植性与互操作性,有效应对ML数据管理与可信人工智能领域的核心挑战。目前Croissant已获得多个主流数据集存储库的支持,覆盖数十万个数据集,并可直接加载至最流行的ML框架中。