Large-scale video generative models have shown emerging capabilities as zero-shot visual planners, yet video-generated plans often violate temporal consistency and physical constraints, leading to failures when mapped to executable actions. To address this, we propose Grounding Video Plans with World Models (GVP-WM), a planning method that grounds video-generated plans into feasible action sequences using a learned action-conditioned world model. At test-time, GVP-WM first generates a video plan from initial and goal observations, then projects the video guidance onto the manifold of dynamically feasible latent trajectories via video-guided latent collocation. In particular, we formulate grounding as a goal-conditioned latent-space trajectory optimization problem that jointly optimizes latent states and actions under world-model dynamics, while preserving semantic alignment with the video-generated plan. Empirically, GVP-WM recovers feasible long-horizon plans from zero-shot image-to-video-generated and motion-blurred videos that violate physical constraints, across navigation and manipulation simulation tasks.
翻译:大规模视频生成模型已展现出作为零样本视觉规划器的初步能力,然而视频生成的规划常违反时序一致性与物理约束,导致映射至可执行动作时失败。为解决此问题,我们提出基于世界模型的视频规划锚定方法(GVP-WM),该方法利用学习得到的动作条件世界模型,将视频生成的规划锚定至可行的动作序列。在测试阶段,GVP-WM首先根据初始观测与目标观测生成视频规划,随后通过视频引导的潜空间配点法,将视频引导信息投影至动态可行的潜轨迹流形。具体而言,我们将锚定过程形式化为目标条件的潜空间轨迹优化问题,在世界模型动力学约束下联合优化潜状态与动作,同时保持与视频生成规划的语义对齐。实验表明,在导航与操作仿真任务中,GVP-WM能从违反物理约束的零样本图像到视频生成及运动模糊视频中,恢复出可行的长时程规划。