Prompt-based learning has been demonstrated as a compelling paradigm contributing to large language models' tremendous success (LLMs). Inspired by their success in language tasks, existing research has leveraged LLMs in embodied instruction following and task planning. In this work, we tackle the problem of training a robot to understand multimodal prompts, interleaving vision signals with text descriptions. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. In this work, we introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts from multi-task expert trajectories. Our methods consist of a two-stage training pipeline that performs inverse dynamics pretraining and multi-task finetuning. To facilitate multimodal understanding, we design our multimodal prompt encoder by augmenting a pretrained LM with a residual connection to the visual input and model the dependencies among action dimensions. Empirically, we evaluate the efficacy of our method on the VIMA-BENCH and establish a new state-of-the-art (10% improvement in success rate). Moreover, we demonstrate that our model exhibits remarkable in-context learning ability. Project page: \url{https://midas-icml.github.io/}.
翻译:基于提示的学习已被证明是一种引人注目的范式,为大型语言模型(LLMs)的巨大成功做出了贡献。受其在语言任务中成功的启发,现有研究已将LLMs应用于具身指令跟随与任务规划。在本工作中,我们致力于训练机器人理解多模态提示,即视觉信号与文本描述交错呈现的任务。此类任务对机器人理解视觉与语言信号之间的相互关联性与互补性提出了重大挑战。本文提出一种有效框架,该框架通过从多任务专家轨迹中学习策略,以执行基于多模态提示的机器人操作。我们的方法包含一个两阶段训练流程:执行逆动力学预训练与多任务微调。为促进多模态理解,我们通过向预训练语言模型添加视觉输入的残差连接来设计多模态提示编码器,并建模动作维度间的依赖关系。实证方面,我们在VIMA-BENCH上评估了方法的有效性,并确立了新的最优性能(成功率提升10%)。此外,我们证明了模型具有显著的上下文学习能力。项目页面:\url{https://midas-icml.github.io/}。