Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space. Code and Data are at https://viewsuite.github.io.
翻译:VLMs能否预测每次相机移动如何改变视图,并提前规划多次这样的移动?我们将这种能力称为视角规划,需要(1)理解单一动作如何改变视图,以及(2)在多次规划中组合这些变换以定位目标视图。我们在提出的ViewSuite(一个基于真实ScanNet场景的3D点云环境)中探究这两种能力。在13个前沿VLMs中,一个关键的规划缺口显现:它们具备基本的视角-动作知识,但无法在多轮规划中组合这些知识,且随着视点距离增加,该缺口不断扩大。为填补这一缺口,我们提出一个迭代框架,交替进行自我探索与视图图蒸馏。关键洞察在于,所有探索轨迹(无论结果如何)共同构成一张视图图,紧凑地捕捉了场景中视点间的连接关系。将该图蒸馏为多样化的监督任务,重新塑造了策略分布,并克服了纯强化学习中稀疏奖励导致的停滞问题。该方法将Qwen2.5-VL-7B在交互式视角规划上的性能从2.5%提升至47.8%,超过了GPT-5.4 Pro(18.5%)和Gemini 3.1 Pro(21.4%)。自我探索展现出推动VLMs在3D空间中主动推理与规划的潜力。代码与数据见https://viewsuite.github.io。