Large Audio Language Models (LALMs) achieve strong performance on a variety of audio understanding tasks but continue to struggle with temporal reasoning, a fundamental capability central to human auditory perception. Understanding the causes of these failures remains challenging as existing benchmarks report performance gaps without probing underlying mechanisms. To address this, we introduce a benchmark with 1,657 questions across three foundational tasks designed specifically for mechanistic analysis. Examining model outputs across varying input settings (behavioral analysis) reveals that models often under-utilize audio when textual cues are available. We also provide the first causal mechanistic analysis of temporal reasoning failures in LALMs. Comparing attention upweighting against scaling, we find that redistributing attention across audio tokens is more effective than increasing audio attention. Targeting task-relevant tokens yields further gains. These findings suggest that modality imbalance alone cannot explain failures. Attention scaling at bottleneck layers improves accuracy from 55.9% to 59.1% without fine-tuning, demonstrating a promising direction for future work.
翻译:大型音频语言模型(LALMs)在多种音频理解任务上表现优异,但在时间推理这一人类听觉感知的核心基础能力上仍存在困难。现有基准仅报告性能差距,未深入探究失败机制,因此理解其成因面临挑战。为此,我们引入一个包含1657个问题的基准测试,涵盖三项专为机制分析设计的基础任务。通过在不同输入设置下检查模型输出(行为分析),发现模型在有文本线索时往往未能充分利用音频信息。我们首次对LALMs的时间推理失败进行了因果机制分析。通过对比注意力权重提升与缩放策略,发现重新分配音频token上的注意力比单纯增加音频注意力更为有效,而针对任务相关token的注意力分配可带来进一步增益。这些结果表明,模态失衡并非失败的唯一原因。瓶颈层注意力缩放无需微调即可将准确率从55.9%提升至59.1%,为未来研究指明了方向。