The goal of this paper is to improve the performance and reliability of vision-language-action (VLA) models through iterative online interaction. Since collecting policy rollouts in the real world is expensive, we investigate whether a learned simulator-specifically, an action-conditioned video generation model-can be used to generate additional rollout data. Unfortunately, existing world models lack the physical fidelity necessary for policy improvement: they are predominantly trained on demonstration datasets that lack coverage of many different physical interactions (particularly failure cases) and struggle to accurately model small yet critical physical details in contact-rich object manipulation. We propose a simple iterative improvement algorithm that uses real-world roll-out data to improve the fidelity of the world model, which can then, in turn, be used to generate supplemental synthetic data for improving the VLA model. In our experiments on a real robot, we use this approach to improve the performance of a state-of-the-art VLA model on multiple downstream tasks. We achieve a 39.2% absolute success rate improvement over the base policy and 11.6% improvement from training with the generated synthetic rollouts. Videos can be found at this anonymous website: https://sites.google.com/view/vla-w
翻译:本文的目标是通过迭代式在线交互,提升视觉-语言-动作(VLA)模型的性能与可靠性。由于在现实世界中收集策略部署数据成本高昂,我们探究能否利用学习得到的模拟器——具体而言,一种动作条件视频生成模型——来生成额外的部署数据。然而,现有的世界模型缺乏策略改进所需的物理保真度:它们主要在演示数据集上进行训练,这些数据集缺乏对多种不同物理交互(尤其是失败案例)的覆盖,并且难以准确建模接触密集型物体操作中微小但关键的物理细节。我们提出一种简单的迭代改进算法,该算法利用真实世界部署数据提升世界模型的保真度,进而利用改进后的世界模型生成辅助合成数据以优化VLA模型。在真实机器人实验中,我们应用该方法提升了当前先进VLA模型在多个下游任务上的表现。相比基线策略,我们实现了39.2%的绝对成功率提升,其中通过生成合成部署数据进行训练带来了11.6%的改进。相关视频可访问此匿名网站:https://sites.google.com/view/vla-w