OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models

Spatial reasoning is fundamental to auditory perception, yet current audio large language models (ALLMs) largely rely on unstructured binaural cues and single step inference. This limits both perceptual accuracy in direction and distance estimation and the capacity for interpretable reasoning. Recent work such as BAT demonstrates spatial QA with binaural audio, but its reliance on coarse categorical labels (left, right, up, down) and the absence of explicit geometric supervision constrain resolution and robustness. We introduce the $\textbf{Spatial-Acoustic Geometry Encoder (SAGE}$), a geometry-aware audio encoder that aligns binaural acoustic features with 3D spatial structure using panoramic depth images and room-impulse responses at training time, while requiring only audio at inference. Building on this representation, we present $\textbf{OWL}$, an ALLM that integrates $\textbf{SAGE}$ with a spatially grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and distance estimates. Through curriculum learning from perceptual QA to multi-step reasoning, $\textbf{OWL}$ supports o'clock-level azimuth and DoA estimation. To enable large-scale training and evaluation, we construct and release $\textbf{BiDepth}$, a dataset of over one million QA pairs combining binaural audio with panoramic depth images and room impulse responses across both in-room and out-of-room scenarios. Across two benchmark datasets, our new $\textbf{BiDepth}$ and the public SpatialSoundQA, $\textbf{OWL}$ reduces mean DoA error by $\textbf{11$^{\circ}$}$ through $\textbf{SAGE}$ and improves spatial reasoning QA accuracy by up to $\textbf{25}$\% over BAT.

翻译：空间推理是听觉感知的基础，然而当前的音频大语言模型主要依赖于非结构化的双耳线索和单步推理。这限制了其在方向和距离估计上的感知准确性以及可解释推理的能力。近期工作如BAT展示了利用双耳音频进行空间问答的能力，但其对粗粒度类别标签（左、右、上、下）的依赖以及缺乏显式几何监督，制约了其分辨率和鲁棒性。我们引入了**空间-声学几何编码器**，这是一种几何感知的音频编码器，它在训练时利用全景深度图像和房间脉冲响应将双耳声学特征与三维空间结构对齐，而在推理时仅需音频输入。基于此表示，我们提出了**OWL**，一个集成了**SAGE**与空间锚定的思维链的音频大语言模型，用于对到达方向和距离估计进行合理化推理。通过从感知问答到多步推理的课程学习，**OWL**支持时钟级别的方位角和到达方向估计。为了支持大规模训练和评估，我们构建并发布了**BiDepth**，这是一个包含超过一百万问答对的数据集，结合了双耳音频、全景深度图像以及房间内和房间外场景下的房间脉冲响应。在两个基准数据集（我们新的**BiDepth**和公开的SpatialSoundQA）上，**OWL**通过**SAGE**将平均到达方向误差降低了**11$^{\circ}$**，并将空间推理问答的准确率较BAT提升了最高**25**%。