Existing large audio-language models perceive the world as "mono"-a single stream of audio that ignores the critical spatial dimension ("where") required for universal audio scene analysis (ASA). To bridge this gap, we first introduce a hierarchical framework for audio scene analysis. Guided by this framework, we introduce a system that enables large audio-language models (LALMs) to understand and reason about the complex acoustic world. Our system endows LALMs with universal spatial understanding through four key innovations: (1) A scalable simulation pipeline that synthesizes high-quality First-Order-Ambisonics(FOA) data; (2) A unified model framework that integrates universal spatial encoding with a dense hybrid projection mechanism to bridge the modality gap; (3) A progressive training curriculum that evolves from representation alignment to reinforcement learning-based reasoning; and (4) A comprehensive benchmark for audio scene analysis (ASA) designed to rigorously evaluate atomic perception, relational integration, and cognitive reasoning capabilities, on which our model demonstrates comparatively strong capability for spatial understanding. Our work provides a clear pathway for leveraging the powerful reasoning abilities of LALMs towards holistic ASA, advancing from "mono" semantic recognition to spatial intelligence.
翻译:现有的大型音频-语言模型将世界感知为“单声道”——即忽略通用音频场景分析所需关键空间维度(“何处”)的单一音频流。为弥补这一差距,我们首先提出了一个用于音频场景分析的层次化框架。在此框架指导下,我们引入了一个使大型音频-语言模型能够理解并推理复杂声学世界的系统。我们的系统通过四项关键创新赋予LALMs通用的空间理解能力:(1)可扩展的仿真流水线,用于合成高质量一阶Ambisonics数据;(2)统一模型框架,通过稠密混合投影机制将通用空间编码与模态鸿沟相融合;(3)从表征对齐演进至基于强化学习的推理的渐进式训练课程;(4)为音频场景分析设计的综合基准测试,用于严格评估原子感知、关系整合与认知推理能力——在该基准上我们的模型展现出相对强大的空间理解能力。本研究为利用LALMs强大的推理能力实现整体性音频场景分析提供了清晰路径,推动从“单声道”语义识别向空间智能的演进。