In recent years, instruction-tuned Large Multimodal Models (LMMs) have been successful at several tasks, including image captioning and visual question answering; yet leveraging these models remains an open question for robotics. Prior LMMs for robotics applications have been extensively trained on language and action data, but their ability to generalize in different settings has often been less than desired. To address this, we introduce LLARVA, a model trained with a novel instruction tuning method that leverages structured prompts to unify a range of robotic learning tasks, scenarios, and environments. Additionally, we show that predicting intermediate 2-D representations, which we refer to as "visual traces", can help further align vision and action spaces for robot learning. We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model, and we evaluate on 12 different tasks in the RLBench simulator as well as a physical Franka Emika Panda 7-DoF robot. Our experiments yield strong performance, demonstrating that LLARVA - using 2-D and language representations - performs well compared to several contemporary baselines, and can generalize across various robot environments and configurations.
翻译:近年来,指令调优的大型多模态模型(LMMs)在图像描述和视觉问答等多项任务中取得了成功;然而,如何将这些模型应用于机器人领域仍是一个开放性问题。先前用于机器人应用的LMMs已在语言和动作数据上进行了广泛训练,但其在不同场景下的泛化能力往往不尽如人意。为解决这一问题,我们提出了LLARVA模型,该模型采用一种新颖的指令调优方法进行训练,利用结构化提示来统一多种机器人学习任务、场景和环境。此外,我们证明预测中间二维表征(我们称之为“视觉轨迹”)能够进一步对齐机器人学习中的视觉与动作空间。我们从Open X-Embodiment数据集中生成了850万张图像-视觉轨迹对以预训练模型,并在RLBench仿真器中的12个不同任务以及实体Franka Emika Panda 7自由度机器人上进行了评估。实验结果表明,LLARVA模型——利用二维与语言表征——相较于若干现有基线方法表现优异,且能够泛化至不同的机器人环境与配置。