Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.
翻译:认知科学表明,空间能力是逐步发展的——从感知到推理再到交互。然而在多模态大语言模型(MLLMs)中,这种层次结构仍鲜为人知,因为大多数研究仅关注有限的任务集合。我们提出了SpatialTree,这是一个受认知科学启发的层次结构,将空间能力组织为四个层级:低级感知(L1)、心理映射(L2)、模拟(L3)和智能体能力(L4)。基于此分类体系,我们构建了首个以能力为中心的层次化基准测试,全面评估了主流MLLMs在27项子能力上的表现。评估结果揭示了一个清晰的结构:L1技能基本相互独立,而更高层级的技能则呈现强相关性,表明其相互依赖性逐渐增强。通过有针对性的监督微调,我们发现了令人惊讶的迁移动态——L1内部存在负迁移,但从低层级到高层级能力却存在显著的跨层级正向迁移与协同效应。最后,我们探索了如何提升整个能力层次。研究发现,鼓励大量“思考”的简单强化学习策略并不可靠:它虽有助于复杂推理,却会损害直觉感知。我们提出了一种简单的自动思考策略,该策略能抑制不必要的深思熟虑,使强化学习能够持续提升所有层级的性能。通过构建SpatialTree,我们为理解和系统化扩展MLLMs的空间能力提供了一个概念验证框架。