Existing large audio-language models perceive the world as "mono" -- a single stream of audio that ignores the critical spatial dimension ("where") required for universal acoustic scene analysis. To bridge this gap, we first introduce a hierarchical framework for Auditory Scene Analysis (ASA). Guided by this framework, we introduce a system that enables models like Qwen2-Audio to understand and reason about the complex acoustic world. Our framework achieves this through three core contributions: First, we build a large-scale, synthesized binaural audio dataset to provide the rich spatial cues. Second, we design a hybrid feature projector, which leverages parallel semantic and spatial encoders to extract decoupled representations. These distinct streams are integrated via a dense fusion mechanism, ensuring the model receives a holistic view of the acoustic scene. Finally, we employ a progressive training curriculum, advancing from supervised fine-tuning (SFT) to reinforcement learning via Group Relative Policy Optimization (GRPO), to explicitly evolve the model's capabilities towards reasoning. On our comprehensive benchmark, the model demonstrates comparatively strong capability for spatial understanding. By enabling this spatial perception, our work provides a clear pathway for leveraging the powerful reasoning abilities of large models towards holistic acoustic scene analysis, advancing from "mono" semantic recognition to spatial intelligence.
翻译:现有的大型音频-语言模型将世界感知为“单声道”——即忽略通用声学场景分析所需关键空间维度(“何处”)的单一音频流。为弥补这一差距,我们首先引入了一个用于听觉场景分析(ASA)的层次化框架。在此框架指导下,我们提出一个系统,使Qwen2-Audio等模型能够理解并推理复杂的声学世界。我们的框架通过三项核心贡献实现这一目标:首先,我们构建了大规模合成双耳音频数据集以提供丰富的空间线索。其次,我们设计了混合特征投影器,利用并行语义编码器与空间编码器提取解耦表征。这些独立信息流通过密集融合机制进行整合,确保模型获得声学场景的整体视图。最后,我们采用渐进式训练课程,从监督微调(SFT)推进至基于群组相对策略优化(GRPO)的强化学习,显式提升模型的推理能力。在综合基准测试中,该模型展现出相对较强的空间理解能力。通过实现这种空间感知,我们的工作为利用大模型强大推理能力进行整体声学场景分析提供了清晰路径,推动从“单声道”语义识别向空间智能的演进。