Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for Vision-Language-Action (VLA) models, but its requirement for massive real-world interaction prevents direct deployment on physical robots. Recent work attempts to use learned world models as simulators for policy optimization, yet closed-loop imagined rollouts inevitably suffer from hallucination and long-horizon error accumulation. Such errors do not merely degrade visual fidelity; they corrupt the optimization signal, encouraging policies to exploit model inaccuracies rather than genuine task progress. We propose WoVR, a reliable world-model-based reinforcement learning framework for post-training VLA policies. Instead of assuming a faithful world model, WoVR explicitly regulates how RL interacts with imperfect imagined dynamics. It improves rollout stability through a controllable action-conditioned video world model, reshapes imagined interaction to reduce effective error depth via Keyframe-Initialized Rollouts, and maintains policy-simulator alignment through World Model-Policy co-evolution. Extensive experiments on LIBERO benchmarks and real-world robotic manipulation demonstrate that WoVR enables stable long-horizon imagined rollouts and effective policy optimization, improving average LIBERO success from 39.95% to 69.2% (+29.3 points) and real-robot success from 61.7% to 91.7% (+30.0 points). These results show that learned world models can serve as practical simulators for reinforcement learning when hallucination is explicitly controlled.
翻译:强化学习(RL)有望为视觉-语言-动作(VLA)模型解锁超越模仿学习的能力,但其对大量现实世界交互的需求阻碍了在物理机器人上的直接部署。近期研究尝试使用学习得到的世界模型作为策略优化的模拟器,然而闭环想象轨迹不可避免地受到幻觉和长时程误差累积的影响。此类误差不仅会降低视觉保真度,更会破坏优化信号,促使策略利用模型的不准确性而非真正的任务进展。我们提出WoVR,一个用于后训练VLA策略的、基于可靠世界模型的强化学习框架。WoVR并不假设一个忠实的世界模型,而是明确规范RL如何与不完美的想象动态进行交互。它通过一个可控的动作条件视频世界模型来提升轨迹稳定性,利用关键帧初始化轨迹来重塑想象交互以减少有效误差深度,并通过世界模型-策略协同进化来维持策略与模拟器的一致性。在LIBERO基准测试和真实世界机器人操作上的大量实验表明,WoVR能够实现稳定的长时程想象轨迹和有效的策略优化,将LIBERO平均成功率从39.95%提升至69.2%(+29.3个百分点),并将真实机器人成功率从61.7%提升至91.7%(+30.0个百分点)。这些结果表明,当幻觉被明确控制时,学习得到的世界模型可以作为强化学习的实用模拟器。