Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning

While Vision-Language Models (VLMs) exhibit exceptional 2D visual understanding, their ability to comprehend and reason about 3D space--a cornerstone of spatial intelligence--remains superficial. Current methodologies attempt to bridge this domain gap either by relying on explicit 3D modalities or by augmenting VLMs with partial, view-conditioned geometric priors. However, such approaches hinder scalability and ultimately burden the language model with the ill-posed task of implicitly reconstructing holistic 3D geometry from sparse cues. In this paper, we argue that spatial intelligence can emerge inherently from 2D vision alone, rather than being imposed via explicit spatial instruction tuning. To this end, we introduce Spa3R, a self-supervised framework that learns a unified, view-invariant spatial representation directly from unposed multi-view images. Spa3R is built upon the proposed Predictive Spatial Field Modeling (PSFM) paradigm, where Spa3R learns to synthesize feature fields for arbitrary unseen views conditioned on a compact latent representation, thereby internalizing a holistic and coherent understanding of the underlying 3D scene. We further integrate the pre-trained Spa3R Encoder into existing VLMs via a lightweight adapter to form Spa3-VLM, effectively grounding language reasoning in a global spatial context. Experiments on the challenging VSI-Bench demonstrate that Spa3-VLM achieves state-of-the-art accuracy of 58.6% on 3D VQA, significantly outperforming prior methods. These results highlight PSFM as a scalable path toward advancing spatial intelligence. Code is available at https://github.com/hustvl/Spa3R.

翻译：尽管视觉-语言模型（VLMs）在二维视觉理解方面展现出卓越能力，但其对三维空间——空间智能的基石——的理解与推理能力仍显浅薄。现有方法试图通过依赖显式的三维模态，或为VLMs增强部分视角条件化的几何先验来弥合这一领域鸿沟。然而，此类方法阻碍了可扩展性，并最终使语言模型背负了从稀疏线索中隐式重建整体三维几何这一不适定任务的负担。本文认为，空间智能可以仅从二维视觉中内在地涌现，而非通过显式的空间指令调优强加。为此，我们提出了Spa3R，一个自监督框架，它能够直接从无位姿的多视角图像中学习统一的、视角不变的空间表示。Spa3R建立在所提出的预测性空间场建模（PSFM）范式之上，通过学习基于紧凑潜在表示合成任意未见视角的特征场，从而内化对底层三维场景的整体且连贯的理解。我们进一步通过轻量级适配器将预训练的Spa3R编码器集成到现有的VLMs中，形成Spa3-VLM，有效地将语言推理锚定于全局空间上下文中。在具有挑战性的VSI-Bench上的实验表明，Spa3-VLM在三维视觉问答（3D VQA）上达到了58.6%的最先进准确率，显著优于先前方法。这些结果凸显了PSFM作为推进空间智能的一条可扩展路径。代码发布于 https://github.com/hustvl/Spa3R。