This work presents WorldCompass, a novel Reinforcement Learning (RL) post-training framework for the long-horizon, interactive video-based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively "steer" the world model's exploration, we introduce three core innovations tailored to the autoregressive video generation paradigm: 1) Clip-level rollout Strategy: We generate and evaluate multiple samples at a single target clip, which significantly boosts rollout efficiency and provides fine-grained reward signals. 2) Complementary Reward Functions: We design reward functions for both interaction-following accuracy and visual quality, which provide direct supervision and effectively suppress reward-hacking behaviors. 3) Efficient RL Algorithm: We employ the negative-aware fine-tuning strategy coupled with various efficiency optimizations to efficiently and effectively enhance model capacity. Evaluations on the SoTA open-source world model, WorldPlay, demonstrate that WorldCompass significantly improves interaction accuracy and visual fidelity across various scenarios.
翻译:本文提出了WorldCompass,一种新颖的强化学习后训练框架,专为基于交互式视频的长视野世界模型设计,使其能够根据交互信号更准确、更一致地探索世界。为了有效“引导”世界模型的探索,我们针对自回归视频生成范式引入了三项核心创新:1) 片段级展开策略:我们在单个目标片段上生成并评估多个样本,这显著提升了展开效率,并提供了细粒度的奖励信号。2) 互补奖励函数:我们为交互跟随准确性和视觉质量分别设计了奖励函数,它们提供了直接监督,并有效抑制了奖励作弊行为。3) 高效强化学习算法:我们采用负感知微调策略,并结合多种效率优化技术,以高效且有效地提升模型能力。在目前最先进的开源世界模型WorldPlay上的评估表明,WorldCompass在各种场景下均能显著提升交互准确性和视觉保真度。