The AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, in the audio representation learning community, the present audio-language datasets suffer from limitations such as insufficient volume, simplistic content, and arduous collection procedures. To tackle these challenges, we present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs, and construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs. To demonstrate the effectiveness of the proposed dataset, we train popular models on our dataset and show performance improvement on various downstream tasks, namely, audio-language retrieval, audio captioning, environment classification. In addition, we establish a novel test set and provide a benchmark for audio-text tasks. The proposed dataset will be released at https://auto-acd.github.io/.
翻译:人工智能社区在大规模多模态数据集的驱动下,在开发强大基础模型方面取得了显著进展。然而,在音频表征学习社区中,现有的音频-语言数据集存在数据量不足、内容简单、收集过程繁琐等局限性。为解决这些挑战,我们提出了一种创新的自动化音频字幕生成流程,该流程基于一系列公共工具或API,并构建了一个大规模、高质量的音频-语言数据集,命名为Auto-ACD,包含超过190万对音频-文本对。为验证所提出数据集的有效性,我们在此数据集上训练了主流模型,并在多个下游任务(包括音频-语言检索、音频字幕生成、环境分类)中展示了性能提升。此外,我们建立了一个新的测试集,并为音频-文本任务提供了基准测试。所提出的数据集将发布于https://auto-acd.github.io/。