JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints and datasets will be released upon acceptance.

翻译：当前的视听大语言模型（AV-LLMs）主要局限于二维感知，依赖于RGB视频和单声道音频。这种设计选择引入了根本性的维度不匹配，阻碍了在复杂三维环境中进行可靠的声源定位和空间推理。我们通过提出JAEGER框架来解决这一局限，该框架将AV-LLMs扩展至三维空间，通过整合RGB-D观测和多通道一阶Ambisonics音频，实现联合空间定位与推理。我们工作的一个核心贡献是神经强度向量（Neural IV），这是一种学习得到的空间音频表示，它编码了鲁棒的定向线索，以增强到达方向估计，即使在存在重叠声源的不利声学场景中。为了促进大规模训练和系统评估，我们提出了SpatialSceneQA基准，这是一个包含61k个从模拟物理环境中筛选的指令微调样本的数据集。大量实验表明，我们的方法在多样化的空间感知与推理任务上持续超越以二维为中心的基线模型，凸显了显式三维建模对于在物理环境中推进人工智能发展的必要性。我们的源代码、预训练模型检查点和数据集将在论文被接受后发布。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

扭曲还是编造？视频大语言模型幻觉研究综述

专知会员服务

14+阅读 · 4月15日

《缓解大语言模型（LLMs）幻觉：面向应用的检索增强生成（RAG）、推理与智能体系统综述》

专知会员服务

24+阅读 · 2025年10月29日

如何构建o1模型推理能力？清华北大等提出LLaVA-o1: 让视觉语言模型逐步推理

专知会员服务

31+阅读 · 2024年11月19日

【CVPR2024】MA-LMM: 内存增强的大型多模态模型，用于长期视频理解

专知会员服务

21+阅读 · 2024年4月9日