While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach boosts GPT-5's performance on VSI-Bench by 8% overall and more than 11% on tasks that rely heavily on spatial layout understanding. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.
翻译:尽管多模态大语言模型(MLLMs)在二维视觉理解方面取得了显著成功,但其对三维空间的推理能力仍然有限。为弥补这一不足,我们提出了基于几何参照的三维场景表示方法(GR3D)。给定一组输入图像,GR3D 为图像中的物体标注唯一标识符,并将其三维几何属性编码为以这些标识符索引的文本参照。这种表示方式使得 MLLMs 能够利用其基于语言的数学推理能力解析三维线索,同时以紧密耦合的方式分析二维视觉特征。我们提出了一种基于 GR3D 的简洁而有效的方法,该方法无需额外训练,可直接应用于不同的 MLLMs。在零样本设置下,我们的方法将 GPT-5 在 VSI-Bench 上的整体性能提升了 8%,在高度依赖空间布局理解的任务上提升超过 11%。定性研究进一步表明,GR3D 能够使 MLLMs 仅通过高度稀疏的输入视角即可完成复杂的空间推理任务。