Vision-Language-Action (VLA) models trained via imitation learning suffer from significant performance degradation in data-scarce scenarios due to their reliance on large-scale demonstration datasets. Although reinforcement learning (RL)-based post-training has proven effective in addressing data scarcity, its application to VLA models is hindered by the non-resettable nature of real-world environments. This limitation is particularly critical in high-risk domains such as industrial automation, where interactions often induce state changes that are costly or infeasible to revert. Furthermore, existing VLA approaches lack a reliable mechanism for detecting task completion, leading to redundant actions that reduce overall task success rates. To address these challenges, we propose World-Env, an RL-based post-training framework that replaces physical interaction with a low-cost, world model-based virtual simulator. World-Env consists of two key components: (1) a video-based world simulator that generates temporally consistent future visual observations, and (2) a vision-language model (VLM)-guided instant reflector that provides continuous reward signals and predicts action termination. This simulated environment enables VLA models to safely explore and generalize beyond their initial imitation learning distribution. Our method achieves notable performance gains with as few as five expert demonstrations per task. Experiments on complex robotic manipulation tasks demonstrate that World-Env effectively overcomes the data inefficiency, safety constraints, and inefficient execution of conventional VLA models that rely on real-world interaction, offering a practical and scalable solution for post-training in resource-constrained settings. Our code is available at https://github.com/junjxiao/world-env.
翻译:通过模仿学习训练的视觉-语言-动作(VLA)模型,因其对大规模演示数据集的依赖,在数据稀缺场景下会遭受显著的性能下降。尽管基于强化学习(RL)的后训练已被证明能有效应对数据稀缺问题,但其在VLA模型中的应用却受限于真实世界环境的不可重置性。这一限制在工业自动化等高风险领域尤为关键,因为交互常常引发状态改变,而这些改变代价高昂或难以逆转。此外,现有的VLA方法缺乏可靠的任务完成检测机制,导致冗余动作,降低了整体任务成功率。为应对这些挑战,我们提出了World-Env,一种基于强化学习的后训练框架,它使用一个低成本、基于世界模型的虚拟模拟器来替代物理交互。World-Env包含两个关键组件:(1)一个基于视频的世界模拟器,用于生成时间上一致的未来视觉观测;(2)一个由视觉语言模型(VLM)引导的即时反射器,用于提供连续的奖励信号并预测动作终止。这个模拟环境使VLA模型能够安全地探索并泛化到其初始模仿学习分布之外。我们的方法在每项任务仅需五个专家演示的情况下,即可实现显著的性能提升。在复杂的机器人操作任务上的实验表明,World-Env有效克服了依赖真实世界交互的传统VLA模型存在的数据效率低下、安全约束以及执行效率低的问题,为资源受限环境下的后训练提供了一个实用且可扩展的解决方案。我们的代码发布于 https://github.com/junjxiao/world-env。