VoiceGiraffe: A Benchmark for Extreme Long-Context Audio-Language Understanding

While large audio language models (LALMs) have achieved remarkable progress in audio processing at the second- or minute-level scale, understanding hour-level audio remains a fundamental bottleneck. Existing benchmarks predominantly rely on short clips or artificially concatenated segments, failing to faithfully assess LALM capacity for long-range information comprehension in real-world scenarios such as podcasts and lengthy speeches. To address this gap, we introduce VoiceGiraffe, a novel benchmark designed to rigorously evaluate LALMs across diverse real-world scenarios, modalities, and languages under long-context settings. It comprises 1500 curated triplets structured into a dual-level taxonomy of single-hop perception and multi-hop reasoning. We evaluate a broad suite of open-source and proprietary LALMs against human performance. Results underscore three fundamental findings. First, VoiceGiraffe remains highly challenging and far from saturation. Second, we show that no single inference paradigm universally dominates. The E2E inference benefits models with native long-context audio understanding, cascaded caption aggregation stabilizes small models overwhelmed by hour-scale audio, and reasoning-enhanced cascading with external LLM helps weaker models but can bottleneck stronger proprietary systems. Third, we reveal long-range memory persistence as a key bottleneck. LALMs are better at answering questions that require connecting salient causal cues than those requiring sustained tracking of sparse events across long audio, whereas humans show the opposite pattern. These findings position VoiceGiraffe as a challenging and diagnostic testbed for long-form audio understanding, highlighting the need for LALMs with persistent memory and robust long-range aggregation.

翻译：尽管大型音频语言模型在秒级或分钟级音频处理中取得了显著进展，理解小时级音频仍是根本性瓶颈。现有基准测试主要依赖短片段或人为拼接的片段，无法真实评估大音频语言模型在播客、长篇演讲等真实场景中的长程信息理解能力。为填补这一空白，我们提出VoiceGiraffe，一种新型基准测试，旨在长上下文设置下严格评估大音频语言模型在多样化真实场景、模态和语言中的表现。该基准包含1500个精心编排的三元组，构建为单跳感知和多跳推理的双层级分类体系。我们评估了多种开源与商业大音频语言模型，并与人类表现进行对比。结果揭示三项核心发现：第一，VoiceGiraffe仍具极高挑战性且远未饱和；第二，没有任何单一推理范式具有普遍优势——端到端推理有利于原生具备长上下文音频理解能力的模型，级联字幕聚合可稳定被小时级音频淹没的小规模模型，而借助外部大语言模型的增强型级联推理能辅助弱模型，但可能成为更强商业系统的瓶颈；第三，我们发现长程记忆持久性是关键瓶颈。大音频语言模型更擅长回答需要关联显著因果线索的问题，而非需要长时间跨段追踪稀疏事件的问题，而人类表现出相反模式。这些发现将VoiceGiraffe定位为长格式音频理解领域兼具挑战性与诊断价值的测试平台，凸显了大音频语言模型对持久记忆与鲁棒长程聚合机制的需求。