WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

Zhennan Jiang,Shangqing Zhou,Yutong Jiang,Zefang Huang,Mingjie Wei,Yuhui Chen,Tianxing Zhou,Zhen Guo,Hao Lin,Quanlu Zhang,Yu Wang,Haoran Li,Chao Yu,Dongbin Zhao

from arxiv, 21pages, 8 figures

Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for Vision-Language-Action (VLA) models, but its requirement for massive real-world interaction prevents direct deployment on physical robots. Recent work attempts to use learned world models as simulators for policy optimization, yet closed-loop imagined rollouts inevitably suffer from hallucination and long-horizon error accumulation. Such errors do not merely degrade visual fidelity; they corrupt the optimization signal, encouraging policies to exploit model inaccuracies rather than genuine task progress. We propose WoVR, a reliable world-model-based reinforcement learning framework for post-training VLA policies. Instead of assuming a faithful world model, WoVR explicitly regulates how RL interacts with imperfect imagined dynamics. It improves rollout stability through a controllable action-conditioned video world model, reshapes imagined interaction to reduce effective error depth via Keyframe-Initialized Rollouts, and maintains policy-simulator alignment through World Model-Policy co-evolution. Extensive experiments on LIBERO benchmarks and real-world robotic manipulation demonstrate that WoVR enables stable long-horizon imagined rollouts and effective policy optimization, improving average LIBERO success from 39.95% to 69.2% (+29.3 points) and real-robot success from 61.7% to 91.7% (+30.0 points). These results show that learned world models can serve as practical simulators for reinforcement learning when hallucination is explicitly controlled.

翻译：强化学习（RL）有望为视觉-语言-动作（VLA）模型解锁超越模仿学习的能力，但其对大量现实世界交互的需求阻碍了在物理机器人上的直接部署。近期研究尝试使用学习得到的世界模型作为策略优化的模拟器，然而闭环想象轨迹不可避免地受到幻觉和长时程误差累积的影响。此类误差不仅会降低视觉保真度，更会破坏优化信号，促使策略利用模型的不准确性而非真正的任务进展。我们提出WoVR，一个用于后训练VLA策略的、基于可靠世界模型的强化学习框架。WoVR并不假设一个忠实的世界模型，而是明确规范RL如何与不完美的想象动态进行交互。它通过一个可控的动作条件视频世界模型来提升轨迹稳定性，利用关键帧初始化轨迹来重塑想象交互以减少有效误差深度，并通过世界模型-策略协同进化来维持策略与模拟器的一致性。在LIBERO基准测试和真实世界机器人操作上的大量实验表明，WoVR能够实现稳定的长时程想象轨迹和有效的策略优化，将LIBERO平均成功率从39.95%提升至69.2%（+29.3个百分点），并将真实机器人成功率从61.7%提升至91.7%（+30.0个百分点）。这些结果表明，当幻觉被明确控制时，学习得到的世界模型可以作为强化学习的实用模拟器。