Scaling Audio-Text Retrieval with Multimodal Large Language Models

Audio-text retrieval is crucial for bridging acoustic signals and natural language. While contrastive dual-encoder architectures like CLAP have shown promise, they are fundamentally limited by the capacity of small-scale encoders. Specifically, the text encoders struggle to understand complex queries that require reasoning or world knowledge. In this paper, we propose AuroLA, a novel contrastive language-audio pre-training framework that re-purposes Multimodal Large Language Models (MLLMs) as a unified backbone for retrieval. Specifically, we make three contributions: (i) we construct a scalable data pipeline that curates diverse audio from multiple sources and generates multi-granular captions, ranging from long descriptions to structured tags, via automated annotation; (ii) we adapt an MLLM for retrieval by prompting it to summarize the audio/text input and using the hidden state of a special token as audio/text embeddings. For model training, we devise a novel Hybrid-NCE loss, which employs multi-granular supervision and hard-negative reweighting to robustly align audio with diverse textual supervision; and (iii) we design an MLLM-based bidirectional re-ranking module that refines retrieval candidates through deep cross-modal interaction. Extensive experiments demonstrate that AuroLA consistently outperforms state-of-the-art models, including the recent PE-AV, while utilizing only approximately 1% of PE-AV's training data. Lastly, we observe clear scaling trends regarding dataset size and model capacity, validating the effectiveness of MLLM as a unified backbone for audio-text retrieval. Code is available at https://github.com/Jazzcharles/AuroLA.

翻译：音频-文本检索对于连接声学信号与自然语言至关重要。尽管如CLAP等对比式双编码器架构已展现出潜力，但其根本上受限于小规模编码器的容量。具体而言，文本编码器难以理解需要推理或世界知识的复杂查询。本文提出AuroLA——一种新颖的对比式语言-音频预训练框架，其将多模态大语言模型重新定位为检索的统一骨干网络。具体贡献包括：（i）构建可扩展的数据处理流程，从多源整理多样化音频数据，并通过自动标注生成从长描述到结构化标签的多粒度字幕；（ii）通过提示MLLM总结音频/文本输入，并利用特殊标记的隐藏状态作为音频/文本嵌入，从而将MLLM适配于检索任务。在模型训练中，设计了一种新颖的混合NCE损失函数，该函数采用多粒度监督与困难负样本重加权机制，以鲁棒地对齐音频与多样化文本监督；（iii）设计了基于MLLM的双向重排序模块，通过深度跨模态交互优化检索候选结果。大量实验表明，AuroLA在仅使用约1% PE-AV训练数据的情况下，持续优于包括近期PE-AV在内的最先进模型。最后，我们观察到关于数据集规模与模型容量的明显扩展规律，验证了MLLM作为音频-文本检索统一骨干网络的有效性。代码发布于https://github.com/Jazzcharles/AuroLA。