Although Audio Large Language Models (ALLMs) have witnessed substantial advancements, their long audio understanding capabilities remain unexplored. A plethora of benchmarks have been proposed for general audio tasks, they predominantly focus on short-form clips, leaving without a consensus on evaluating ALLMs over extended durations. This paper proposes ChronosAudio, the first multi-task benchmark tailored for long-audio understanding in ALLMs. It encompasses six major task categories and comprises 36,000 test instances totaling over 200 hours audio, stratified into short, middle, and long-form categories to comprehensively evaluate length generalization. Extensive experiments on 16 state-of-the-art models using ChronosAudio yield three critical findings: 1.Precipitous Long-Context Collapse: ALLMs exhibit a severe inability to sustain performance, with the transition from short to long contexts triggering a staggering performance degradation of over 90% in specific tasks. 2.Structural Attention Dilution: Performance degradation stems from a fundamental failure in maintaining temporal locality; attention mechanisms suffer from significant diffusion in later sequences. 3.Restorative Ceiling of Mitigation: Current strategies only offer 50% recovery. These findings reveal significant challenges in long-audio, underscoring the urgent need for approaches to achieve robust, document-level audio reasoning.
翻译:尽管音频大语言模型(ALLMs)已取得显著进展,但其长音频理解能力仍未被充分探索。虽然已有大量针对通用音频任务的基准被提出,但它们主要集中于短时音频片段,导致在评估ALLMs处理长时音频方面缺乏共识。本文提出了ChronosAudio,这是首个专为ALLMs长音频理解定制的多任务基准。它涵盖六大任务类别,包含总计超过200小时音频的36,000个测试实例,并分为短、中、长三种时长类别,以全面评估模型的长度泛化能力。基于ChronosAudio对16个前沿模型进行的广泛实验得出三个关键发现:1. 急剧的长上下文崩溃:ALLMs表现出严重的性能维持能力不足,从短上下文过渡到长上下文时,在特定任务中引发超过90%的性能骤降。2. 结构性注意力稀释:性能下降源于维持时间局部性的根本性失效;注意力机制在后续序列中出现显著扩散。3. 缓解策略的恢复上限:现有策略仅能实现50%的性能恢复。这些发现揭示了长音频处理中的重大挑战,凸显了迫切需要新方法以实现鲁棒的文件级音频推理。