Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.
翻译:视频世界建模的最新进展使得大规模生成模型能够以高视觉保真度模拟具身环境,为预测、规划和控制提供了强大的先验。然而,尽管这些模型具有真实感,但它们往往缺乏几何基础,限制了其在需要空间一致性和稳定性的导航任务中的应用。我们引入了基于世界几何基础的强化学习(RLWG),这是一种自监督的后训练框架,通过几何和感知奖励将预训练的世界模型与物理可验证的结构对齐。类似于语言模型中基于可验证反馈的强化学习(RLVR),RLWG可以利用多种奖励来衡量姿态循环一致性、深度重投影和时间一致性。我们通过GrndCtrl实例化了这一框架,这是一种基于组相对策略优化(GRPO)的奖励对齐适应方法,由此产生的世界模型能够为具身导航保持稳定的轨迹、一致的几何结构和可靠的推演。与大型语言模型中的后训练对齐类似,GrndCtrl利用可验证的奖励来桥接生成式预训练与基础化行为,在户外环境中实现了优于监督微调的空间一致性和导航稳定性。