This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 60% on VSI-Bench.
翻译:本文探索具身三维空间理解,即多模态大语言模型智能体通过工具使用执行三维推理的任务。现有方法常误用工具,且在三维场景下表现出有偏的工具偏好,导致具身范式相较非具身策略仅有微弱提升。我们发现三维空间推理任务跨场景具有异质性,而现有智能体对所有场景采用统一的工具使用策略,未根据特定场景和任务选择工具。为此,我们提出Skill-3D框架,通过学习自演化的场景感知技能解决该问题。具体而言,Skill-3D识别任务场景并将智能体的工具使用轨迹记录至场景记忆库,其中来自相似场景的成功轨迹被聚合蒸馏为可复用的场景感知技能,失败轨迹则作为经验教训附于技能。训练过程中,当相似场景再次出现时,注入对应技能以引导智能体产生新轨迹,其成败结果进一步优化该技能,形成记忆库与技能库协同演化的闭环。实验表明,Skill-3D显著提升了三维空间推理中的工具利用率(VSI-Bench基准从39%提升至78%),推动智能体实现正确且充分的工具使用。例如,在MMSI-Bench上使Gemini-3-Flash提升67%。此外,我们基于技能引导轨迹进行具身后训练,在VSI-Bench上将Qwen3-VL-8B提升60%。