Although Audio Large Language Models (ALLMs) have witnessed substantial advancements, their long audio understanding capabilities remain unexplored. A plethora of benchmarks have been proposed for general audio tasks, they predominantly focus on short-form clips, leaving without a consensus on evaluating ALLMs over extended durations. This paper proposes ChronosAudio, the first multi-task benchmark tailored for long-audio understanding in ALLMs. It encompasses six major task categories and comprises 36,000 test instances totaling over 200 hours audio, stratified into short, middle, and long-form categories to comprehensively evaluate length generalization. Extensive experiments on 16 state-of-the-art models using ChronosAudio yield three critical findings: 1.Precipitous Long-Context Collapse: ALLMs exhibit a severe inability to sustain performance, with the transition from short to long contexts triggering a staggering performance degradation of over 90% in specific tasks. 2.Structural Attention Dilution: Performance degradation stems from a fundamental failure in maintaining temporal locality; attention mechanisms suffer from significant diffusion in later sequences. 3.Restorative Ceiling of Mitigation: Current strategies only offer 50% recovery. These findings reveal significant challenges in long-audio, underscoring the urgent need for approaches to achieve robust, document-level audio reasoning.
翻译:尽管音频大语言模型(ALLMs)取得了显著进展,但其长音频理解能力仍未被探索。现有大量面向通用音频任务的基准主要聚焦于短片段,尚未形成评估长时长ALLMs的共识。本文提出ChronosAudio——首个专为ALLMs长音频理解设计的多任务基准。该基准涵盖六大任务类别,包含36,000个测试实例,总计超过200小时音频,并按短、中、长三类时长分层设计,以全面评估长度泛化能力。使用ChronosAudio对16个最先进模型进行的大规模实验揭示了三个关键发现:1. 陡峭的长上下文崩溃:ALLMs表现出严重的性能维持能力缺陷,从短上下文到长上下文的过渡导致特定任务性能惊人地下降超过90%;2. 结构性注意力稀释:性能下降源于维持时间局部性的根本性失败,注意力机制在后续序列中遭受显著扩散;3. 缓解措施的可恢复上限:当前策略仅能实现50%的性能恢复。这些发现揭示了长音频领域面临的重大挑战,凸显了实现稳健的文档级音频推理方法的迫切需求。