Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations and enables an efficient transfer of a pretrained VLM into a powerful VLA, motivated by the success of visual instruction tuning in Computer Vision. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets, aligning robotic actions with image pixel coordinates. Further, we enhance this dataset in a self-supervised manner by defining six auxiliary tasks, without requiring any additional action annotations. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control. Through experiments across multiple simulated and real-world tasks, we demonstrate that LLaRA achieves state-of-the-art performance while preserving the generalization capabilities of large language models. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.
翻译:视觉语言模型(VLMs)最近被应用于生成机器人动作,形成了视觉语言动作(VLA)模型。然而,直接适配预训练的VLM用于机器人控制仍然具有挑战性,尤其是在机器人演示数据有限的情况下。在本工作中,我们提出了LLaRA:大型语言与机器人助手,这是一个将机器人动作策略构建为视觉-文本对话的框架,并受到计算机视觉中视觉指令调优成功的启发,能够高效地将预训练的VLM转化为强大的VLA模型。首先,我们提出了一种自动化流程,能够从现有的行为克隆数据集中生成面向机器人的对话式指令调优数据,从而将机器人动作与图像像素坐标对齐。此外,我们以自监督的方式通过定义六项辅助任务来增强该数据集,而无需任何额外的动作标注。我们证明,使用有限数量的此类数据集进行微调的VLM能够为机器人控制生成有意义的动作决策。通过在多个模拟和真实世界任务中的实验,我们证明了LLaRA在保持大型语言模型泛化能力的同时,实现了最先进的性能。代码、数据集和预训练模型可在https://github.com/LostXine/LLaRA获取。