Despite recent improvements in audio-text modeling, audio-text contrastive models still lag behind their image-text counterparts in scale and performance. We propose a method to improve both the scale and the training of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset consisting of over 13,000 hours of text-labeled audio, aided by large language model (LLM) processing and audio captioning. Further, we employ an masked autoencoder (MAE) pre-pretraining phase with random patch dropout, which allows us to both scale unlabeled audio datasets and train efficiently with variable length audio. After MAE pre-pretraining of our audio encoder, we train a contrastive model with an auxiliary captioning objective. Our final model, which we name Cacophony, achieves state-of-the-art performance on audio-text retrieval tasks, and exhibits competitive results on other downstream tasks such as zero-shot classification.
翻译:尽管近年来音频-文本建模取得了进展,但音频-文本对比模型在规模和性能上仍落后于图像-文本对应模型。我们提出了一种改进音频-文本对比模型规模与训练的方法。具体而言,借助大语言模型(LLM)处理与音频描述生成技术,我们构建了一个包含超过13,000小时文本标注音频的大规模音频-文本数据集。此外,我们采用带有随机补丁丢弃的掩码自编码器(MAE)预预训练阶段,从而既能扩展无标注音频数据集的规模,又能高效处理变长音频。在对音频编码器进行MAE预预训练后,我们使用辅助描述生成目标来训练对比模型。最终模型命名为Cacophony,在音频-文本检索任务中达到了最先进的性能,并在零样本分类等其他下游任务中展现出具有竞争力的结果。