Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens. Accordingly, we develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and a four-level evaluation protocol for systematic generalization. We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. VIMA features a recipe that achieves strong model scalability and data efficiency. It outperforms alternative designs in the hardest zero-shot generalization setting by up to $2.9\times$ task success rate given the same training data. With $10\times$ less training data, VIMA still performs $2.7\times$ better than the best competing variant. Code and video demos are available at https://vimalabs.github.io/
翻译:摘要:基于提示的学习已成为自然语言处理领域一种成功的范式,其中单一通用语言模型可被指令执行输入提示指定的任何任务。然而,机器人操作中的任务规范形式多样,例如模拟单次演示、遵循语言指令以及达成视觉目标。这些任务通常被视为不同类别,并由专门模型处理。我们证明,广泛机器人操作任务可通过交错文本与视觉标记的多模态提示进行表达。据此,我们开发了一个新的仿真基准,包含数千个程序生成的桌面任务(带多模态提示)、用于模仿学习的60万+条专家轨迹,以及用于系统性泛化的四级评估协议。我们设计了一个基于Transformer的机器人智能体VIMA,它处理这些提示并自回归输出运动指令。VIMA具备实现强模型可扩展性与数据效率的特定方案。在相同训练数据下,它于最难的零样本泛化设置中任务成功率比替代设计提升高达2.9倍;即使训练数据减少10倍,VIMA仍比最佳变体方案性能优越2.7倍。代码与视频演示请见 https://vimalabs.github.io/