LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and respond with policy decisions in text. We propose LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as conversations and provides improved action outputs when trained with auxiliary data that complements policy learning. We first introduce an automated pipeline to generate conversation-style instruction tuning data from existing behavior cloning data. Then we enrich the dataset in a self-supervised fashion by formulating six auxiliary tasks. A VLM finetuned with the resulting collection of datasets can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.
翻译:具备视觉输入能力的大型语言模型(即视觉语言模型)能够将状态信息处理为视觉-文本提示,并以文本形式输出策略决策。我们提出LLaRA(大型语言与机器人助手)框架,该框架将机器人动作策略构建为对话任务,并通过补充策略学习的辅助数据训练获得更优的动作输出。我们首先提出自动化流程,从现有行为克隆数据生成对话式指令调优数据;随后通过构建六项辅助任务以自监督方式扩展数据集。使用该集成数据集微调的视觉语言模型能够生成具有实际意义的机器人动作策略决策。我们在多个仿真与真实环境中的实验表明,所提出的LLaRA框架实现了最先进的性能。代码、数据集及预训练模型已发布于 https://github.com/LostXine/LLaRA。