We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).
翻译:我们研究如何将基于互联网规模数据训练的视觉-语言模型直接集成到端到端机器人控制中,以提升泛化能力并实现涌现式语义推理。我们的目标是让单个端到端训练的模型既能学习将机器人观测映射至动作,又能受益于从网络获取的大规模语言与视觉-语言数据预训练。为此,我们提出对最先进的视觉-语言模型在机器人轨迹数据与互联网规模视觉-语言任务(如视觉问答)上进行联合微调。与其他方法不同,我们提出一种简单通用的方案实现该目标:为将自然语言响应与机器人动作统一为相同格式,我们将动作表示为文本令牌,并如同自然语言令牌一样直接将其纳入模型的训练集。我们将这类模型称为视觉-语言-动作模型(VLA),并实例化一个名为RT-2的示例模型。广泛评估(包含6000次试)表明,我们的方法能生成高性能机器人策略,并使RT-2从互联网规模训练中获得一系列涌现能力。这包括:显著提升对新颖物体的泛化能力;理解机器人训练数据中未出现的指令(如将物体放置在特定数字或图标上);以及对用户指令进行基础推理(如拾取最小或最大物体,或最接近另一物体的物体)。我们进一步证明,融入思维链推理使RT-2能执行多阶段语义推理,例如判断哪种物体可用作临时锤子(石头),或哪种饮品最适合疲劳的人(能量饮料)。