Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox

Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 verified examples, spanning 10 paralinguistic tasks, created with controlled speech synthesis to intentionally mismatch transcript claims and speaking style, enabling direct measurement of speech paralinguistic understanding. Evaluation of a diverse set of Audio LLMs reveals consistently low accuracy on acoustic ground truth and a strong tendency to follow language-implied (incorrect) answers. To understand the cause of this gap, we perform layer-wise probing and find that (i) paralinguistic cues can degrade in deeper encoder layers and at the encoder--LLM interface, and (ii) even when such cues are available in audio tokens, the language model frequently ignores them. To address these problems, we propose Prompt-Conditioned Layer Mixer (PCLM), which adaptively combines information from multiple audio layers based on the input prompt, and pair it with Direct Preference Optimization (DPO) to explicitly prefer acoustically supported options over language-implied alternatives. These methods substantially improve Audio LLM paralinguistic understanding, improving Audio Flamingo 3 from 17.40% to 65.20% on VoxParadox, and from 37.74% to 54.78% on MMSU paralinguistic subset. Our project page is available at https://voxparadox.github.io/.

翻译：音频大语言模型（Audio LLMs）在语音理解任务中展现出强大性能，但其对副语言信息的理解能力仍十分有限。为系统量化这一问题，我们提出VoxParadox——一个包含2000个经过验证的样本、覆盖10项副语言任务的对抗性基准测试集。该基准通过受控语音合成技术，刻意制造文本内容与说话风格之间的不匹配，从而直接评估模型对语音副语言特征的理解能力。对多种音频大语言模型的评估结果显示：模型在基于声学真实标签的任务中准确率普遍偏低，且存在强烈依赖语言暗示（错误）答案的倾向。为探究此差距的成因，我们进行逐层分析发现：(i) 副语言线索在深层编码器层及编码器-大语言模型接口处可能发生退化；(ii) 即便音频token中包含此类线索，语言模型仍常忽略它们。针对这些问题，我们提出提示条件层混合器（PCLM），该模块可根据输入提示自适应融合多层音频信息，并结合直接偏好优化（DPO）显式优先选择基于声学证据的选项而非语言暗示的替代答案。这些方法显著提升了音频大语言模型的副语言理解能力：Audio Flamingo 3在VoxParadox上的准确率从17.40%提升至65.20%，在MMSU副语言子集上从37.74%提升至54.78%。项目页面详见https://voxparadox.github.io/。