Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment of practical retrieval robustness. We present Omni-Embed-Audio (OEA), a retrieval-oriented encoder leveraging multimodal LLMs with native audio understanding. To systematically evaluate robustness beyond caption-style queries, we introduce User-Intent Queries (UIQs) - five formulations reflecting natural search behaviors: questions, commands, keyword tags, paraphrases, and exclusion-based negative queries. For negative queries, we develop a hard negative mining pipeline and propose discrimination metrics (HNSR, TFR) assessing models' ability to suppress acoustically similar distractors. Experiments on AudioCaps, Clotho, and MECAT show that OEA achieves comparable text-to-audio retrieval performance to state-of-the-art M2D-CLAP, while demonstrating clear advantages in two critical areas: (1) dominant text-to-text retrieval (+22% relative improvement), and (2) substantially superior hard negative discrimination (+4.3%p HNSR@10, +34.7% relative TFR@10), revealing that LLM backbones provide superior semantic understanding of complex queries.
翻译:基于对比语言-音频预训练(CLAP)的音频-文本检索系统在传统基准测试中表现优异;然而,这些基准测试依赖的标题式查询与真实世界搜索行为存在显著差异,限制了其对实际检索鲁棒性的评估。我们提出Omni-Embed-Audio(OEA),一种基于检索的编码器,利用具备原生音频理解能力的多模态大语言模型。为系统性地评估超越标题式查询的鲁棒性,我们引入用户意图查询(UIQs)——五种反映自然搜索行为的表述形式:疑问句、指令句、关键词标签、释义句及基于排除的否定查询。针对否定查询,我们开发了难负样本挖掘流程,并提出判别性指标(HNSR、TFR)以评估模型抑制声学相似干扰项的能力。在AudioCaps、Clotho和MECAT上的实验表明,OEA在文本到音频检索方面取得与最先进M2D-CLAP相当的性能,同时在两个关键领域展现出明显优势:(1)主导性的文本到文本检索(相对提升+22%),以及(2)显著更优的难负样本判别能力(HNSR@10提升+4.3个百分点,TFR@10相对提升+34.7%),揭示了LLM骨干网络对复杂查询具备更强的语义理解能力。