Robot learning from interacting with the physical world is fundamentally bottlenecked by the cost of physical interaction. The two alternatives, supervised finetuning (SFT) from expert demonstrations and reinforcement learning (RL) in a software-based simulator, are limited by the amount of expert data available and the sim-to-real gap for manipulation. With the recent emergence of world models learned from real-world video-action data, we ask the question of whether training a policy in a world model can be more effective than supervised learning or software simulation in achieving better real-robot performance. We propose World-Gymnast, which performs RL finetuning of a vision-language-action (VLA) policy by rolling out the policy in an action-conditioned video world model and rewarding the rollouts with a vision-language model (VLM). On the Bridge robot setup, World-Gymnast outperforms SFT by as much as 18x and outperforms software simulator by as much as 2x. More importantly, World-Gymnast demonstrates intriguing capabilities of RL with a world model, including training on diverse language instructions and novel scenes from the world model, test-time training in a novel scene, and online iterative world model and policy improvement. Our results suggest learning a world model and training robot policies in the cloud could be the key to bridging the gap between robots that work in demonstrations and robots that can work in anyone's household.
翻译:机器人通过与物理世界交互进行学习,从根本上受到物理交互成本的制约。两种替代方案——基于专家演示的监督微调(SFT)和在基于软件的模拟器中进行强化学习(RL)——分别受限于可用专家数据的数量以及操作任务中的模拟到现实差距。随着最近从真实世界视频-动作数据中学习得到的世界模型的出现,我们提出一个问题:在世界模型中训练策略,是否比监督学习或软件模拟更能有效提升真实机器人的性能?我们提出了World-Gymnast,该方法通过在动作条件化的视频世界模型中展开策略,并使用视觉语言模型(VLM)对展开结果进行奖励,从而对视觉-语言-动作(VLA)策略进行强化学习微调。在Bridge机器人实验设置中,World-Gymnast的性能超越SFT高达18倍,超越软件模拟器高达2倍。更重要的是,World-Gymnast展示了在世界模型中进行强化学习的引人注目的能力,包括基于世界模型中的多样化语言指令和新场景进行训练、在新场景中进行测试时训练,以及在线迭代式世界模型与策略改进。我们的结果表明,学习一个世界模型并在云端训练机器人策略,可能是弥合仅能在演示中工作的机器人与能够在任何家庭中工作的机器人之间差距的关键。