Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We call this failure the evidence bottleneck: state-of-the-art systems show larger deficits in evidence extraction than in downstream reasoning, suggesting that the main limitation lies in upstream perception rather than reasoning policy. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that combines Whisper and CED-Base through non-compressive, time-aligned fusion. EvA first aggregates intermediate CED layers to preserve multi-scale acoustic cues, then aligns the aggregated CED features to the Whisper timeline and adds the two streams without changing sequence length. We also build EvA-Perception, a large-scale open-source training set with about 54K event-ordered captions (150 h) and about 500K QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source Perception scores on MMAU, MMAR, and MMSU, and improves over Kimi-Audio-7B on all reported metrics, with the largest gains on perception-heavy splits. These results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning.
翻译:大型音频语言模型(LALM)在复杂声学场景中仍面临挑战,原因在于它们在推理开始前往往未能保留与任务相关的声学证据。我们将这一缺陷称为"证据瓶颈":现有系统在证据提取方面的缺陷显著大于下游推理,表明主要限制在于上游感知而非推理策略。为此,我们提出EvA(证据优先音频),这是一种通过非压缩时间对齐融合Whisper与CED-Base的双路径架构。EvA首先聚合中间CED层以保留多尺度声学线索,然后将聚合后的CED特征与Whisper时间线对齐,在不改变序列长度的前提下合并两条流。我们还构建了EvA-Perception——包含约5.4万条事件排序描述(150小时)和约50万个问答对的大规模开源训练集。在统一零样本协议下,EvA在MMAU、MMAR和MMSU上取得最佳开源感知分数,并在所有报告指标上优于Kimi-Audio-7B,其中感知密集型分割上的提升最为显著。这些结果支持"证据优先"假说:更强的音频理解取决于在推理前保留声学证据。