When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6% text dominance under audio-text conflict versus 1.6% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2%) exceeds cascade accuracy (93.9%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19% to 33%), sacrificing audio's information advantage without improving accessibility. Framing text as "deliberately corrupted" reduces text dominance by 80%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5%), while LoRA on the language model halves it ($-$23.9%), localizing text dominance to the LLM's reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.

翻译：当音频与文本信息冲突时，具备语音功能的语言模型遵循文本的概率比处理两个文本源冲突时高出10倍，即使被明确指示信任音频。通过ALME基准测试（包含8种语言共57,602个受控音频-文本冲突刺激），我们发现Gemini 2.0 Flash在音频-文本冲突下表现出16.6%的文本主导性，而在具有相同可靠性线索的文本-文本冲突中仅为1.6%。这种差异无法用音频质量解释：纯音频准确率（97.2%）超过级联系统准确率（93.9%），表明音频嵌入比文本转录保留了更多信息。我们认为文本主导性反映的并非信息内容的不对称，而是仲裁可及性的不对称：即模型对竞争表征进行推理的难易程度。该框架解释了其他令人困惑的发现：强制在回答前进行转录会将文本主导性从19%提升至33%，在未改善可及性的同时牺牲了音频的信息优势；将文本标注为“故意篡改”可使文本主导性降低80%。微调消融实验提供了干预证据：仅训练音频投影层会使文本主导性上升26.5%，而对语言模型进行LoRA微调可使其降低23.9%，这证明文本主导性源于大语言模型的推理机制而非音频编码器。在四种最先进的音频大语言模型和8种语言上的实验显示出一致趋势，同时存在显著的跨语言与跨模型差异，从而确立模态仲裁作为标准语音基准未能涵盖的独立可靠性维度。