Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning

Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that trains the world model on its own autoregressive rollouts rather than on ground-truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that generates and compares multiple candidate variable-length futures from the same rollout state, reinforcing higher-fidelity predictions over lower-fidelity ones. Third, we develop efficient, multi-view visual fidelity rewards that combine complementary perceptual metrics across camera views and are aggregated at the clip level for dense, low-variance training signal. Fourth, we show that our approach establishes a new state-of-the-art for rollout fidelity on the DROID dataset, outperforming the strongest baseline on all metrics (e.g., LPIPS reduced by 14% on external cameras, SSIM improved by 9.1% on the wrist camera), winning 98% of paired comparisons, and achieving an 80% preference rate in a blind human study.

翻译：动作条件机器人世界模型根据机器人动作序列生成操作场景的未来视频帧，为模拟传统物理引擎难以建模的任务提供了有前景的替代方案。然而，这些模型针对短时预测进行优化，在自回归部署时会出现性能崩溃：每个预测片段作为上下文反馈给下一片段，导致误差累积且视觉质量迅速退化。我们通过以下贡献解决该问题。首先，我们提出一种强化学习（RL）后训练方案，使世界模型在其自身的自回归生成轨迹上训练，而非依赖真实历史。通过将近期针对扩散模型的对比强化学习目标适配到当前场景，我们证明其收敛保证可精确迁移。其次，我们设计训练协议：从同一自回归状态生成并比较多个候选变长未来片段，强化高保真预测的优先级。第三，我们开发高效的多视角视觉保真度奖励函数，整合跨相机视角的互补感知指标，并在片段级别聚合以提供密集、低方差训练信号。第四，我们的方法在DROID数据集上建立了自回归片段保真度的新最优水平，全面超越最强基线（例如，外部相机LPIPS降低14%，腕部相机SSIM提升9.1%），在98%的配对比较中胜出，并在盲人实验中获得80%的偏好率。