Despite recent advancements in audio-text modeling, audio-text contrastive models still lag behind their image-text counterparts in scale and performance. We propose a method to improve both the scale and the training of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset containing 13,000 hours of text-labeled audio, using pretrained language models to process noisy text descriptions and automatic captioning to obtain text descriptions for unlabeled audio samples. We first train on audio-only data with a masked autoencoder (MAE) objective, which allows us to benefit from the scalability of unlabeled audio datasets. We then, initializing our audio encoder from the MAE model, train a contrastive model with an auxiliary captioning objective. Our final model, which we name Cacophony, achieves state-of-the-art performance on audio-text retrieval tasks, and exhibits competitive results on the HEAR benchmark and other downstream tasks such as zero-shot classification.
翻译:尽管音频-文本建模近期取得进展,但对比式音频-文本模型在规模与性能上仍落后于图像-文本对应模型。我们提出一种方法,旨在提升音频-文本对比模型的规模与训练效率。具体而言,我们构建了一个包含13,000小时带文本标注音频的大规模音频-文本数据集,利用预训练语言模型处理含噪文本描述,并通过自动字幕生成技术为未标注音频样本获取文本描述。我们首先采用掩码自编码器(MAE)目标对纯音频数据进行预训练,从而受益于未标注音频数据集的可扩展性。随后,以MAE模型初始化音频编码器,结合辅助字幕生成目标训练对比模型。最终模型命名为Cacophony,在音频-文本检索任务上达到最优性能,并在HEAR基准测试及零样本分类等下游任务中展现出具有竞争力的结果。