The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a foundation agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned decision transformer capable of consuming multi-embodiment action-labelled visual experience. This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions. With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100--1000 examples for the target task. We also show how a trained model itself can be used to generate data for subsequent training iterations, thus providing a basic building block for an autonomous improvement loop. We investigate the agent's capabilities, with large-scale evaluations both in simulation and on three different real robot embodiments. We find that as we grow and diversify its training data, RoboCat not only shows signs of cross-task transfer, but also becomes more efficient at adapting to new tasks.
翻译:利用来自不同机器人和任务的异构机器人经验,快速掌握新技能和具身形态的能力,有望彻底改变机器人学习。受近期视觉与语言基础模型进展的启发,我们提出一种面向机器人操作的基础智能体。该智能体名为罗博猫,是一种视觉目标条件化的决策变换器,能够处理多具身形态的视觉标签动作经验数据。这些数据涵盖来自模拟环境和真实机器人手臂的一整套运动控制技能,且包含不同观测与动作组合。通过罗博猫,我们展示了其面向新任务和机器人的泛化能力,既可通过零样本方式实现,也能仅使用目标任务的100-1000个示例进行适配后达成。我们还展示了如何利用训练后的模型自身生成后续训练迭代所需的数据,从而为自主改进循环提供基本构建模块。我们通过大规模仿真实验及三种不同真实机器人具身形态的评估,深入探究该智能体的能力。研究发现,随着训练数据的增长与多样化,罗博猫不仅展现出跨任务迁移的迹象,其对新任务的适配效率也持续提升。