The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: https://github.com/wdqqdw/Echo.
翻译:大型音频语言模型(LALMs)的成熟使人们对其能够像人类一样理解复杂音频的期望日益增长。当前的研究主要通过一次性编码将音频内容语境化,以复制基于文本的推理模式,但这引入了关键的信息瓶颈。受人类认知过程的启发,我们提出音频交织推理以突破这一瓶颈。该方法将音频视为主动的推理组件,实现持续的音频参与和基于感知的分析。为具体实现这一方法,我们引入了一个两阶段训练框架:首先通过监督微调教导LALMs定位关键音频片段,随后通过强化学习激励其进行熟练的重听。同时,我们开发了一个结构化的数据生成流程以产生高质量的训练数据。由此,我们提出了Echo——一个能够在推理过程中按需动态重听音频的LALM。在音频理解基准测试中,Echo在具有挑战性的专家级任务和通用任务上均展现出整体优势。综合分析进一步证实了音频交织推理的效率和泛化能力,确立了其作为推进音频理解发展的一个有前景的方向。项目页面:https://github.com/wdqqdw/Echo。