Whisper has become the de-facto encoder for extracting general-purpose audio features in large audio-language models, where a 30-second clip is typically represented by 1500 frame features projected into an LLM. In contrast, audio-text embedding models like CLAP-based models have largely relied on alternative audio encoders (e.g., HTS-AT, PaSST), and have not leveraged Whisper effectively. We present WavLink, a compact audio-text embedding model that augments Whisper encoder with a learnable global token, trained jointly with a text encoder. Through a systematic study of design choices, including pretrained text encoders, loss functions, training modes, and data mixtures, we identify configurations that yield state-of-the-art retrieval performance. Our two-stage training recipe across three model sizes, combined with Matryoshka-style supervision, improves scalability, enabling 8x smaller embeddings with minimal performance drop. WavLink also demonstrates competitive performance on AIR-Bench with MCQs and zero-shot classification.
翻译:Whisper已成为大型音频-语言模型中提取通用音频特征的事实编码器,其通常将30秒音频片段表示为1500个帧特征并投影至大语言模型。相比之下,基于CLAP的音频-文本嵌入模型主要依赖其他音频编码器(如HTS-AT、PaSST),未能有效利用Whisper。本文提出WavLink——一种紧凑的音频-文本嵌入模型,通过为Whisper编码器添加可学习的全局令牌,并与文本编码器进行联合训练。通过对预训练文本编码器、损失函数、训练模式及数据混合等设计选择进行系统研究,我们确定了能实现最先进检索性能的配置方案。我们在三种模型规模上采用两阶段训练策略,结合Matryoshka式监督方法,提升了模型扩展性,实现了嵌入维度缩小8倍而性能损失最小。WavLink在AIR-Bench的多选题和零样本分类任务中也展现出竞争优势。