Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints and datasets will be released upon acceptance.
翻译:当前的视听大语言模型(AV-LLMs)主要局限于二维感知,依赖于RGB视频和单声道音频。这种设计选择引入了根本性的维度不匹配,阻碍了在复杂三维环境中进行可靠的声源定位和空间推理。我们通过提出JAEGER框架来解决这一局限,该框架将AV-LLMs扩展至三维空间,通过整合RGB-D观测和多通道一阶Ambisonics音频,实现联合空间定位与推理。我们工作的一个核心贡献是神经强度向量(Neural IV),这是一种学习得到的空间音频表示,它编码了鲁棒的定向线索,以增强到达方向估计,即使在存在重叠声源的不利声学场景中。为了促进大规模训练和系统评估,我们提出了SpatialSceneQA基准,这是一个包含61k个从模拟物理环境中筛选的指令微调样本的数据集。大量实验表明,我们的方法在多样化的空间感知与推理任务上持续超越以二维为中心的基线模型,凸显了显式三维建模对于在物理环境中推进人工智能发展的必要性。我们的源代码、预训练模型检查点和数据集将在论文被接受后发布。