A rich representation is key to general robotic manipulation, but existing model architectures require a lot of data to learn it. Unfortunately, ideal robotic manipulation training data, which comes in the form of expert visuomotor demonstrations for a variety of annotated tasks, is scarce. In this work we propose PLEX, a transformer-based architecture that learns from task-agnostic visuomotor trajectories accompanied by a much larger amount of task-conditioned object manipulation videos -- a type of robotics-relevant data available in quantity. The key insight behind PLEX is that the trajectories with observations and actions help induce a latent feature space and train a robot to execute task-agnostic manipulation routines, while a diverse set of video-only demonstrations can efficiently teach the robot how to plan in this feature space for a wide variety of tasks. In contrast to most works on robotic manipulation pretraining, PLEX learns a generalizable sensorimotor multi-task policy, not just an observational representation. We also show that using relative positional encoding in PLEX's transformers further increases its data efficiency when learning from human-collected demonstrations. Experiments showcase \appr's generalization on Meta-World-v2 benchmark and establish state-of-the-art performance in challenging Robosuite environments.
翻译:丰富的表征是实现通用机器人操作的关键,但现有模型架构需要大量数据才能学习到这种表征。遗憾的是,理想的机器人操作训练数据——以各类标注任务的专家视觉运动演示形式存在——十分稀缺。在这项工作中,我们提出PLEX,一种基于Transformer的架构,它从与任务无关的视觉运动轨迹以及数量更大的任务条件物体操作视频(一种可大量获取的机器人相关数据类型)中学习。PLEX背后的关键洞见在于:包含观测和动作的轨迹有助于诱导出潜在特征空间,并训练机器人执行与任务无关的操作程序;而多样化的纯视频演示则能高效地教会机器人如何在该特征空间中规划各种任务。与大多数机器人操作预训练工作不同,PLEX学习的是可泛化的多任务感知运动策略,而不仅仅是观测表征。我们还表明,在PLEX的Transformer中使用相对位置编码可在学习人类采集的演示时进一步提高其数据效率。实验展示了PLEX在Meta-World-v2基准上的泛化能力,并在具有挑战性的Robosuite环境中取得了最先进的性能。