Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.
翻译:机器人学习具有巨大的潜力,能够释放灵活、通用且灵巧的机器人系统的全部潜能,并解决人工智能领域一些最深刻的问题。然而,将机器人学习提升到有效现实世界系统所需的通用性水平,在数据、泛化性和鲁棒性方面面临着重大障碍。本文探讨了通用型机器人策略(即机器人基础模型)如何应对这些挑战,以及如何为复杂且高度灵巧的任务设计有效的通用型机器人策略。我们提出了一种新颖的流匹配架构,该架构构建在预训练的视觉-语言模型之上,以继承互联网规模的语义知识。随后,我们讨论了如何利用来自多个灵巧机器人平台(包括单臂机器人、双臂机器人和移动机械臂)的大型多样化数据集来训练该模型。我们从模型在预训练后执行零样本任务的能力、遵循来自人类和高级VLM策略的语言指令的能力,以及通过微调获取新技能的能力等方面评估了该模型。我们的实验结果涵盖了多种任务,例如叠衣服、清洁桌面和组装箱子。