空间推理中的世界基础：基于接地的空间推理器 (Reasoning in Space via Grounding in the World)

In this paper, we claim that 3D visual grounding is the cornerstone of spatial reasoning and introduce the Grounded-Spatial Reasoner (GS-Reasoner) to explore the effective spatial representations that bridge the gap between them. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information. This deficiency is manifested either in poor performance on grounding or in an excessive reliance on external modules, ultimately hindering the seamless integration of grounding and spatial reasoning. To address this, we propose a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation that encapsulates all essential information without increasing the number of input tokens. Leveraging this holistic representation, GS-Reasoner is the first 3D LLM that achieves autoregressive grounding entirely without external modules while delivering performance comparable to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning. To further bridge grounding and spatial reasoning, we introduce the Grounded Chain-of-Thought (GCoT) dataset. This dataset is meticulously curated to include both 3D bounding box annotations for objects referenced in reasoning questions and step-by-step reasoning paths that integrate grounding as a core component of the problem-solving process. Extensive experiments demonstrate that GS-Reasoner achieves impressive results on 3D visual grounding, which in turn significantly enhances its spatial reasoning capabilities, leading to state-of-the-art performance.

翻译：本文主张三维视觉接地是空间推理的基石，并引入Grounded-Spatial Reasoner（GS-Reasoner）来探索能有效连接二者的空间表征。现有三维大语言模型缺乏能够同时捕获语义与几何信息的统一三维表征，这一缺陷表现为接地性能不佳或过度依赖外部模块，最终阻碍了接地与空间推理的无缝整合。为解决此问题，我们提出一种简单而有效的双路径池化机制，将几何特征与语义及位置线索紧密对齐，构建基于图像块的统一三维表征。该表征在不增加输入标记数量的前提下封装了所有关键信息。借助这一整体表征，GS-Reasoner成为首个完全无需外部模块即可实现自回归接地的三维大语言模型，其性能与最先进模型相当，从而建立了统一且自包含的三维空间推理框架。为进一步连接接地与空间推理，我们构建了Grounded Chain-of-Thought（GCoT）数据集。该数据集精心标注了推理问题中涉及物体的三维边界框，并提供了将接地作为问题求解核心环节的逐步推理路径。大量实验表明，GS-Reasoner在三维视觉接地任务中取得优异结果，这显著提升了其空间推理能力，最终实现了最先进的性能表现。