Recent progress in robotic world models has leveraged video diffusion transformers to predict future observations conditioned on historical states and actions. While these models can simulate realistic visual outcomes, they often exhibit poor action-following precision, hindering their utility for downstream robotic learning. In this work, we introduce World-VLA-Loop, a closed-loop framework for the joint refinement of world models and Vision-Language-Action (VLA) policies. We propose a state-aware video world model that functions as a high-fidelity interactive simulator by jointly predicting future observations and reward signals. To enhance reliability, we introduce the SANS dataset, which incorporates near-success trajectories to improve action-outcome alignment within the world model. This framework enables a closed-loop for reinforcement learning (RL) post-training of VLA policies entirely within a virtual environment. Crucially, our approach facilitates a co-evolving cycle: failure rollouts generated by the VLA policy are iteratively fed back to refine the world model precision, which in turn enhances subsequent RL optimization. Evaluations across simulation and real-world tasks demonstrate that our framework significantly boosts VLA performance with minimal physical interaction, establishing a mutually beneficial relationship between world modeling and policy learning for general-purpose robotics. Project page: https://showlab.github.io/World-VLA-Loop/.
翻译:机器人世界模型的最新进展利用视频扩散Transformer,在历史状态和动作的条件下预测未来观测。尽管这些模型能够模拟逼真的视觉结果,但其动作跟随精度往往较差,从而阻碍了其在下游机器人学习中的应用。本研究提出了World-VLA-Loop,一个用于联合精炼世界模型与视觉-语言-动作(VLA)策略的闭环框架。我们提出了一种状态感知的视频世界模型,通过联合预测未来观测与奖励信号,充当高保真交互式模拟器。为提高可靠性,我们引入了SANS数据集,该数据集包含接近成功的轨迹,以改善世界模型内部的动作-结果对齐。该框架实现了VLA策略在完全虚拟环境中的强化学习(RL)后训练闭环。关键的是,我们的方法促成了一个协同演化的循环:VLA策略生成的失败轨迹被迭代反馈以提升世界模型的精度,进而增强后续的RL优化。在仿真与现实任务中的评估表明,我们的框架以最少的物理交互显著提升了VLA性能,为通用机器人建立了世界建模与策略学习之间互惠互利的关系。项目页面:https://showlab.github.io/World-VLA-Loop/。