There have recently been large advances both in pre-training visual representations for robotic control and segmenting unknown category objects in general images. To leverage these for improved robot learning, we propose $\textbf{POCR}$, a new framework for building pre-trained object-centric representations for robotic control. Building on theories of "what-where" representations in psychology and computer vision, we use segmentations from a pre-trained model to stably locate across timesteps, various entities in the scene, capturing "where" information. To each such segmented entity, we apply other pre-trained models that build vector descriptions suitable for robotic control tasks, thus capturing "what" the entity is. Thus, our pre-trained object-centric representations for control are constructed by appropriately combining the outputs of off-the-shelf pre-trained models, with no new training. On various simulated and real robotic tasks, we show that imitation policies for robotic manipulators trained on POCR achieve better performance and systematic generalization than state of the art pre-trained representations for robotics, as well as prior object-centric representations that are typically trained from scratch.
翻译:近期,在面向机器人控制的视觉表征预训练以及通用图像中未知类别物体分割方面取得了显著进展。为利用这些进展提升机器人学习,我们提出$\textbf{POCR}$这一新框架,用于构建面向机器人控制的预训练物体中心表征。基于心理学与计算机视觉中“什么-哪里”表征理论,我们利用预训练模型的分割结果,在不同时间步上稳定定位场景中的各类实体,捕获“哪里”信息。针对每个被分割的实体,我们应用其他预训练模型构建适用于机器人控制任务的向量描述,从而捕获该实体“是什么”。因此,我们通过适当组合现成预训练模型的输出(无需额外训练),构建出用于控制的预训练物体中心表征。在多种仿真与真实机器人任务中,我们发现:基于POCR训练的机器人操作模仿策略,其性能与系统泛化能力均优于当前最先进的机器人预训练表征,以及通常从零训练的先前物体中心表征。