Understanding and reasoning about the physical world requires spatial intelligence: the ability to interpret geometry, perspective, and spatial relations beyond 2D perception. While recent vision large models (VLMs) excel at visual understanding, they remain fundamentally 2D perceivers and struggle with genuine 3D reasoning. We introduce Think3D, a framework that enables VLM agents to think with 3D space. By leveraging 3D reconstruction models that recover point clouds and camera poses from images or videos, Think3D allows the agent to actively manipulate space through camera-based operations and ego/global-view switching, transforming spatial reasoning into an interactive 3D chain-of-thought process. Without additional training, Think3D significantly improves the spatial reasoning performance of advanced models such as GPT-4.1 and Gemini 2.5 Pro, yielding average gains of +7.8% on BLINK Multi-view and MindCube, and +4.7% on VSI-Bench. We further show that smaller models, which struggle with spatial exploration, benefit significantly from a reinforcement learning policy that enables the model to select informative viewpoints and operations. With RL, the benefit from tool usage increases from +0.7% to +6.8%. Our findings demonstrate that training-free, tool-augmented spatial exploration is a viable path toward more flexible and human-like 3D reasoning in multimodal agents, establishing a new dimension of multimodal intelligence. Code and weights are released at https://github.com/zhangzaibin/spagent.
翻译:理解和推理物理世界需要空间智能:即超越二维感知,解释几何、透视和空间关系的能力。尽管当前的视觉大模型在视觉理解方面表现出色,但它们本质上仍是二维感知器,难以进行真正的三维推理。我们提出了Think3D,一个使视觉大模型代理能够利用三维空间进行思考的框架。通过利用从图像或视频中恢复点云和相机姿态的三维重建模型,Think3D允许代理通过基于相机的操作以及自我/全局视角切换来主动操控空间,从而将空间推理转化为一个交互式的三维思维链过程。无需额外训练,Think3D显著提升了如GPT-4.1和Gemini 2.5 Pro等先进模型的空间推理性能,在BLINK Multi-view和MindCube上平均提升+7.8%,在VSI-Bench上平均提升+4.7%。我们进一步表明,那些在空间探索方面存在困难的小型模型,通过一个强化学习策略(使模型能够选择信息丰富的视点和操作)获得了显著收益。借助强化学习,工具使用带来的收益从+0.7%提升至+6.8%。我们的研究结果表明,无需训练、工具增强的空间探索是实现多模态代理中更灵活、更类人三维推理的一条可行路径,从而确立了多模态智能的一个新维度。代码和权重发布于 https://github.com/zhangzaibin/spagent。