The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task generalist agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned decision transformer capable of consuming action-labelled visual experience. This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions. With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100-1000 examples for the target task. We also show how a trained model itself can be used to generate data for subsequent training iterations, thus providing a basic building block for an autonomous improvement loop. We investigate the agent's capabilities, with large-scale evaluations both in simulation and on three different real robot embodiments. We find that as we grow and diversify its training data, RoboCat not only shows signs of cross-task transfer, but also becomes more efficient at adapting to new tasks.
翻译:利用来自不同机器人和任务的异构机器人经验,快速掌握新技能和实体形态的能力,有望彻底改变机器人学习。受近期视觉与语言基础模型进展的启发,我们提出了一种多实体、多任务的通用机器人操作智能体。该智能体名为RoboCat,是一种视觉目标条件化决策Transformer,能够处理带有动作标签的视觉经验。这些数据涵盖来自模拟环境及真实机器人手臂的广泛运动控制技能,并具有不同的观测与动作设置。通过RoboCat,我们展示了其在零样本情境下以及仅需目标任务100-1000个样本进行自适应学习后,泛化至新任务和机器人的能力。我们还展示了如何利用已训练模型本身为后续训练迭代生成数据,从而为自主改进循环提供基本构建模块。我们通过大规模仿真实验及三种不同真实机器人实体上的评估,深入探究了该智能体的能力。研究发现,随着训练数据的扩展与多样化,RoboCat不仅表现出跨任务迁移的迹象,而且在适应新任务时变得更加高效。