RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan,Noah Brown,Justice Carbajal,Yevgen Chebotar,Xi Chen,Krzysztof Choromanski,Tianli Ding,Danny Driess,Avinava Dubey,Chelsea Finn,Pete Florence,Chuyuan Fu,Montse Gonzalez Arenas,Keerthana Gopalakrishnan,Kehang Han,Karol Hausman,Alexander Herzog,Jasmine Hsu,Brian Ichter,Alex Irpan,Nikhil Joshi,Ryan Julian,Dmitry Kalashnikov,Yuheng Kuang,Isabel Leal,Lisa Lee,Tsang-Wei Edward Lee,Sergey Levine,Yao Lu,Henryk Michalewski,Igor Mordatch,Karl Pertsch,Kanishka Rao,Krista Reymann,Michael Ryoo,Grecia Salazar,Pannag Sanketi,Pierre Sermanet,Jaspiar Singh,Anikait Singh,Radu Soricut,Huong Tran,Vincent Vanhoucke,Quan Vuong,Ayzaan Wahid,Stefan Welker,Paul Wohlhart,Jialin Wu,Fei Xia,Ted Xiao,Peng Xu,Sichun Xu,Tianhe Yu,Brianna Zitkovich

from arxiv, Website: https://robotics-transformer.github.io/

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).

翻译：我们研究如何将基于互联网规模数据训练的视觉-语言模型直接集成到端到端机器人控制中，以提升泛化能力并实现涌现式语义推理。我们的目标是让单个端到端训练的模型既能学习将机器人观测映射至动作，又能受益于从网络获取的大规模语言与视觉-语言数据预训练。为此，我们提出对最先进的视觉-语言模型在机器人轨迹数据与互联网规模视觉-语言任务（如视觉问答）上进行联合微调。与其他方法不同，我们提出一种简单通用的方案实现该目标：为将自然语言响应与机器人动作统一为相同格式，我们将动作表示为文本令牌，并如同自然语言令牌一样直接将其纳入模型的训练集。我们将这类模型称为视觉-语言-动作模型（VLA），并实例化一个名为RT-2的示例模型。广泛评估（包含6000次试)表明，我们的方法能生成高性能机器人策略，并使RT-2从互联网规模训练中获得一系列涌现能力。这包括：显著提升对新颖物体的泛化能力；理解机器人训练数据中未出现的指令（如将物体放置在特定数字或图标上）；以及对用户指令进行基础推理（如拾取最小或最大物体，或最接近另一物体的物体）。我们进一步证明，融入思维链推理使RT-2能执行多阶段语义推理，例如判断哪种物体可用作临时锤子（石头），或哪种饮品最适合疲劳的人（能量饮料）。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日