Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.
翻译:近年来,利用多模态大语言模型进行空间推理的研究日益依赖于从3D编码器获取的几何先验。然而,现有的大多数集成策略仍是被动的:几何信息以全局流的形式暴露并以无差别的方式融合,这常常导致语义与几何的错位以及冗余信号。我们提出了GeoThinker框架,该框架将范式从被动融合转向主动感知。不同于特征混合,GeoThinker使模型能够根据其内部推理需求,有选择地检索几何证据。GeoThinker通过在精心选定的视觉语言模型层中应用空间锚定融合来实现这一目标,其中语义视觉先验通过帧严格的交叉注意力机制,有选择地查询并集成与任务相关的几何信息,并进一步通过重要性门控进行校准,该门控使每帧注意力偏向于任务相关的结构。全面的评估结果表明,GeoThinker在空间智能方面确立了新的最先进水平,在VSI-Bench上达到了72.6的峰值分数。此外,GeoThinker在复杂下游场景(包括具身指代和自动驾驶)中展现出强大的泛化能力和显著提升的空间感知能力。我们的结果表明,主动集成空间结构的能力对于下一代空间智能至关重要。代码可在 https://github.com/Li-Hao-yuan/GeoThinker 获取。