In the era of artificial intelligence, the diversity of data modalities and annotation formats often renders data unusable directly, requiring understanding and format conversion before it can be used by researchers or developers with different needs. To tackle this problem, this article introduces a framework called Dataset Description Language (DSDL) that aims to simplify dataset processing by providing a unified standard for AI datasets. DSDL adheres to the three basic practical principles of generic, portable, and extensible, using a unified standard to express data of different modalities and structures, facilitating the dissemination of AI data, and easily extending to new modalities and tasks. The standardized specifications of DSDL reduce the workload for users in data dissemination, processing, and usage. To further improve user convenience, we provide predefined DSDL templates for various tasks, convert mainstream datasets to comply with DSDL specifications, and provide comprehensive documentation and DSDL tools. These efforts aim to simplify the use of AI data, thereby improving the efficiency of AI development.
翻译:在人工智能时代,数据模态与标注格式的多样性常使数据无法直接使用,需经理解与格式转换才能被不同需求的研究者或开发者使用。为解决该问题,本文提出一种称为数据集描述语言(DSDL)的框架,旨在通过为AI数据集提供统一标准来简化数据处理。DSDL遵循通用性、可移植性与可扩展性三项基本实践原则,采用统一标准表达不同模态与结构的数据,促进AI数据传播,并能轻松扩展至新模态与新任务。DSDL的标准化规范降低了用户在数据传播、处理和使用中的工作量。为进一步提升用户便利性,我们为各类任务提供预定义的DSDL模板,将主流数据集转换为符合DSDL规范的格式,并提供完整的文档与DSDL工具。这些工作旨在简化AI数据的使用,从而提升AI开发效率。