Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.
翻译:近期视频基础模型展现了令人印象深刻的视觉合成能力,但经常出现几何不一致的问题。尽管现有方法试图通过架构修改注入3D先验知识,但这些方法通常计算成本高昂且限制了可扩展性。我们提出World-R1框架,通过强化学习使视频生成与3D约束对齐。为促进这种对齐,我们引入了一个专为世界模拟设计的纯文本数据集。利用Flow-GRPO,我们通过来自预训练3D基础模型和视觉语言模型的反馈来优化模型,在不改变底层架构的情况下强制执行结构一致性。我们进一步采用周期性解耦训练策略,以平衡刚体几何一致性与动态场景流畅性。大量评估表明,我们的方法在保持基础模型原有视觉质量的同时显著增强了3D一致性,有效弥合了视频生成与可扩展世界模拟之间的鸿沟。