VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.

翻译：大规模视频扩散模型虽能生成令人惊艳的视觉质量，却常无法保持几何一致性。现有方法通过为生成器附加额外模块或采用几何感知对齐来改善一致性。然而，架构修改会损害互联网规模预训练模型的泛化能力，且现有对齐方法局限于静态场景，依赖需要反复VAE解码的RGB空间奖励，不仅带来巨大计算开销，也难以泛化至高度动态的真实场景。为保留预训练模型能力的同时提升几何一致性，我们提出VGGRPO（视觉几何GRPO）——一种用于视频后训练中保持几何感知的隐空间引导框架。VGGRPO引入隐空间几何模型（LGM），将视频扩散隐空间特征无缝拼接至几何基础模型，实现从隐空间直接解码场景几何。通过基于具备四维重建能力的几何模型构建LGM，VGGRPO自然扩展至动态场景，克服了现有方法仅适用于静态场景的局限。在此基础上，我们采用隐空间群体相对策略优化，并设计两种互补奖励机制：惩罚抖动轨迹的相机运动平滑度奖励，以及强化跨视角几何一致性的几何重投影一致性奖励。在静态与动态基准上的实验表明，VGGRPO在消除高代价VAE解码的同时提升了相机稳定性、几何一致性与整体质量，使隐空间几何引导的强化学习成为实现跨世界一致性视频生成的高效灵活方案。