Driven by the need for larger and more diverse datasets to pre-train and fine-tune increasingly complex machine learning models, the number of datasets is rapidly growing. audb is an open-source Python library that supports versioning and documentation of audio datasets. It aims to provide a standardized and simple user-interface to publish, maintain, and access the annotations and audio files of a dataset. To efficiently store the data on a server, audb automatically resolves dependencies between versions of a dataset and only uploads newly added or altered files when a new version is published. The library supports partial loading of a dataset and local caching for fast access. audb is a lightweight library and can be interfaced from any machine learning library. It supports the management of datasets on a single PC, within a university or company, or within a whole research community.
翻译:摘要:为满足预训练和微调日益复杂的机器学习模型所需更大且更多样化数据集的需求,数据集数量正在快速增长。audb是一个开源的Python库,支持音频数据集的版本管理和文档化。它旨在提供标准化且简单的用户接口,用于发布、维护和访问数据集的标注及音频文件。为高效存储服务器上的数据,audb自动解析数据集各版本间的依赖关系,并仅在新版本发布时上传新增或修改的文件。该库支持数据集的部分加载和本地缓存以实现快速访问。audb是一个轻量级库,可与任何机器学习库进行交互。它支持在单台PC、大学或公司内部乃至整个研究社区内管理数据集。