Driven by the need for larger and more diverse datasets to pre-train and fine-tune increasingly complex machine learning models, the number of datasets is rapidly growing. audb is an open-source Python library that supports versioning and documentation of audio datasets. It aims to provide a standardized and simple user-interface to publish, maintain, and access the annotations and audio files of a dataset. To efficiently store the data on a server, audb automatically resolves dependencies between versions of a dataset and only uploads newly added or altered files when a new version is published. The library supports partial loading of a dataset and local caching for fast access. audb is a lightweight library and can be interfaced from any machine learning library. It supports the management of datasets on a single PC, within a university or company, or within a whole research community. audb is available at https://github.com/audeering/audb.
翻译:随着需要更庞大、更多样化的数据集来预训练和微调日益复杂的机器学习模型,数据集的数量正在迅速增长。audb是一个开源Python库,支持音频数据集的版本控制与文档化。它旨在提供标准化且简单的用户接口,用于发布、维护以及访问数据集的标注和音频文件。为了高效存储服务器上的数据,audb会自动解析数据集版本之间的依赖关系,并在发布新版本时仅上传新增或变更的文件。该库支持数据集的局部加载和本地缓存以实现快速访问。audb是一个轻量级库,可与任何机器学习库交互使用。它支持在单台PC、大学或公司内部乃至整个研究社区中对数据集进行管理。audb可通过https://github.com/audeering/audb获取。