Recent Audio Multimodal Large Language Models (Audio MLLMs) demonstrate impressive performance on speech benchmarks, yet it remains unclear whether these models genuinely process acoustic signals or rely on text-based semantic inference. To systematically study this question, we introduce DEAF (Diagnostic Evaluation of Acoustic Faithfulness), a benchmark of over 2,700 conflict stimuli spanning three acoustic dimensions: emotional prosody, background sounds, and speaker identity. Then, we design a controlled multi-level evaluation framework that progressively increases textual influence, ranging from semantic conflicts in the content to misleading prompts and their combination, allowing us to disentangle content-driven bias from prompt-induced sycophancy. We further introduce diagnostic metrics to quantify model reliance on textual cues over acoustic signals. Our evaluation of seven Audio MLLMs reveals a consistent pattern of text dominance: models are sensitive to acoustic variations, yet predictions are predominantly driven by textual inputs, revealing a gap between high performance on standard speech benchmarks and genuine acoustic understanding.
翻译:近期音频多模态大语言模型(Audio MLLMs)在语音基准测试中展现出卓越性能,然而这些模型究竟是真正处理了声学信号,还是依赖基于文本的语义推理,目前尚不明确。为系统研究这一问题,我们提出DEAF(声学忠实性诊断评估)基准,该基准包含超过2700个冲突刺激,涵盖情绪韵律、背景声音和说话者身份三个声学维度。进而,我们设计了一个受控的多层次评估框架,通过逐步增加文本影响——从内容语义冲突到误导性提示及其组合——从而区分内容驱动的偏差与提示诱导的谄媚性。我们进一步引入诊断性指标,用于量化模型对文本线索相较于声学信号的依赖程度。对七个Audio MLLMs的评估揭示了一致的文本主导模式:模型虽对声学变化敏感,但其预测结果主要由文本输入驱动,这揭示了标准语音基准测试的高性能与真实声学理解能力之间的差距。