A rich representation is key to general robotic manipulation, but existing approaches to representation learning require large amounts of multimodal demonstrations. In this work we propose PLEX, a transformer-based architecture that learns from a small amount of task-agnostic visuomotor trajectories and a much larger amount of task-conditioned object manipulation videos -- a type of data available in quantity. PLEX uses visuomotor trajectories to induce a latent feature space and to learn task-agnostic manipulation routines, while diverse video-only demonstrations teach PLEX how to plan in the induced latent feature space for a wide variety of tasks. Experiments showcase PLEX's generalization on Meta-World and SOTA performance in challenging Robosuite environments. In particular, using relative positional encoding in PLEX's transformers greatly helps in low-data regimes of learning from human-collected demonstrations. The paper's accompanying code and data are available at https://microsoft.github.io/PLEX.
翻译:丰富的表示是通用机器人操作的关键,但现有的表示学习方法需要大量多模态演示。本文提出PLEX,一种基于Transformer的架构,它能够从少量任务无关的视觉运动轨迹和大量任务条件下的物体操作视频(一种可大量获取的数据类型)中学习。PLEX利用视觉运动轨迹诱导潜在特征空间并学习任务无关的操作例程,而多样化的纯视频演示则教会PLEX如何在诱导的潜在特征空间中为各种任务进行规划。实验展示了PLEX在Meta-World上的泛化能力以及在具有挑战性的Robosuite环境中的SOTA性能。特别地,在PLEX的Transformer中使用相对位置编码极大地帮助了在人工收集演示的低数据场景下的学习。本文附带的代码和数据可访问https://microsoft.github.io/PLEX。