Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space. Code and Data are at https://viewsuite.github.io.
翻译:视觉语言模型能否预测每次相机移动如何改变视角,并提前规划多次移动?我们将这种能力称为视角规划,这需要(1)理解单个动作如何改变视角,以及(2)在多次规划中组合这些变换以识别目标视角。我们在提出的ViewSuite中探究了这两种能力,这是一个基于真实ScanNet场景的3D点云环境。在对13个前沿视觉语言模型的评估中,出现了一个关键的规划鸿沟:它们具备基本的视角-动作知识,但无法在多步规划中组合这些知识,且随着视角距离增大,这一差距会扩大。为弥合这一差距,我们提出了一种迭代框架,该框架交替进行自探索与视角图蒸馏。关键洞察在于:所有探索轨迹,无论其结果如何,共同构成了一张视角图,该图紧凑地捕捉了场景中视角之间的连接关系。将此图蒸馏到多样化监督任务中,可以重塑策略分布,并克服纯强化学习中的稀疏奖励问题。这使得Qwen2.5-VL-7B在交互式视角规划上的性能从2.5%提升至47.8%,超过了GPT-5.4 Pro(18.5%)和Gemini 3.1 Pro(21.4%)。自探索为视觉语言模型在3D空间中主动推理与规划提供了一条有前景的路径。代码与数据见https://viewsuite.github.io。