Recent Large Audio Language Models have demonstrated impressive capabilities in audio understanding. However, they often suffer from perceptual errors, while reliable audio reasoning is unattainable without first grounding the model's perception in structured auditory scenes. Inspired by Auditory Scene Analysis, we first introduce a Perception-Aware Question Answering (PAQA) dataset. PAQA implements a hierarchical decoupling strategy that separates speech from environmental sound and distinguishes multiple speakers, providing explicit perceptual reasoning for training. Building on this, we propose HyPeR, a two-stage Hybrid Perception-Reasoning framework. In Stage I, we finetune the model on PAQA to perceive acoustic attributes in complex audio. In Stage II, we leverage GRPO to refine the model's internal deliberation. We also introduce PAUSE tokens to facilitate latent computation during acoustically ambiguous phases and design perceptual consistency reward to align reasoning rationales with raw audio. Experiments across benchmarks demonstrate that HyPeR achieves absolute improvements over the base model, with performance comparable to large-scale models, stressing the effectiveness of hybrid perception-grounded reasoning for robust and multi-speaker audio understanding.
翻译:近年来,大型音频语言模型在音频理解方面展现出令人瞩目的能力。然而,这类模型常受限于感知错误,而缺乏结构化听觉场景中感知锚定的模型无法实现可靠的音频推理。受听觉场景分析启发,我们首先引入感知感知问答(PAQA)数据集。PAQA采用分层解耦策略,将语音与环境声分离,并区分多个说话者,为训练提供显式的感知推理依据。基于此,我们提出HyPeR——一个两阶段混合感知-推理框架。第一阶段,我们通过在PAQA上微调模型,使其能够感知复杂音频中的声学属性。第二阶段,我们利用GRPO优化模型的内部推理过程。同时,我们引入PAUSE标记以在声学模糊阶段促进潜在计算,并设计感知一致性奖励机制,使推理依据与原始音频对齐。跨基准实验表明,HyPeR在基础模型上实现了绝对性能提升,其表现与大规模模型相当,凸显了基于感知锚定的混合推理在鲁棒性及多说话人音频理解中的有效性。