The goal of this paper is to improve the performance and reliability of vision-language-action (VLA) models through iterative online interaction. Since collecting policy rollouts in the real world is expensive, we investigate whether a learned simulator-specifically, an action-conditioned video generation model-can be used to generate additional rollout data. Unfortunately, existing world models lack the physical fidelity necessary for policy improvement: they are predominantly trained on demonstration datasets that lack coverage of many different physical interactions (particularly failure cases) and struggle to accurately model small yet critical physical details in contact-rich object manipulation. We propose a simple iterative improvement algorithm that uses real-world roll-out data to improve the fidelity of the world model, which can then, in turn, be used to generate supplemental synthetic data for improving the VLA model. In our experiments on a real robot, we use this approach to improve the performance of a state-of-the-art VLA model on multiple downstream tasks. We achieve a 39.2% absolute success rate improvement over the base policy and 11.6% improvement from training with the generated synthetic rollouts. Videos can be found at this anonymous website: https://sites.google.com/view/vla-w
翻译:本文的目标是通过迭代式在线交互,提升视觉-语言-动作模型的性能与可靠性。由于在现实世界中收集策略部署数据成本高昂,我们探究能否利用学习得到的模拟器——具体而言,一种动作条件视频生成模型——来生成额外的部署数据。然而,现有的世界模型缺乏策略改进所需的物理保真度:它们主要基于示范数据集进行训练,这些数据集对多种物理交互(特别是失败案例)覆盖不足,且难以精确建模接触密集型物体操作中微小但关键的物理细节。我们提出一种简单的迭代改进算法,利用真实世界部署数据提升世界模型的保真度,进而可生成辅助性合成数据以改进VLA模型。在真实机器人实验中,我们应用该方法提升了一种先进VLA模型在多个下游任务上的性能。相较于基线策略,我们实现了39.2%的绝对成功率提升,其中通过生成合成部署数据训练带来了11.6%的改进。演示视频可访问此匿名网站:https://sites.google.com/view/vla-w