The AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, in the audio representation learning community, the present audio-language datasets suffer from limitations such as insufficient volume, simplistic content, and arduous collection procedures. To tackle these challenges, we present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs, and construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs. To demonstrate the effectiveness of the proposed dataset, we train popular models on our dataset and show performance improvement on various downstream tasks, namely, audio-language retrieval, audio captioning, environment classification. In addition, we establish a novel test set and provide a benchmark for audio-text tasks. The proposed dataset will be released at https://auto-acd.github.io/.
翻译:人工智能社区在开发强大基础模型方面取得了显著进展,这得益于大规模多模态数据集的推动。然而,在音频表征学习领域,当前现有的音频-语言数据集存在数据量不足、内容单一、采集流程繁琐等局限性。为解决上述挑战,我们提出了一种基于系列公共工具或API的创新性自动音频描述生成流程,并构建了名为Auto-ACD的大规模高质量音频-语言数据集,包含超过190万条音频-文本对。为验证所提数据集的有效性,我们在该数据集上训练主流模型,并在音频-语言检索、音频描述生成、环境分类等多个下游任务中展示了性能提升。此外,我们构建了全新的测试集,并为音频-文本任务提供了基准测评。所提数据集将在https://auto-acd.github.io/开放获取。