Generic re-usable pre-trained image representation encoders have become a standard component of methods for many computer vision tasks. As visual representations for robots however, their utility has been limited, leading to a recent wave of efforts to pre-train robotics-specific image encoders that are better suited to robotic tasks than their generic counterparts. We propose Scene Objects From Transformers, abbreviated as SOFT, a wrapper around pre-trained vision transformer (PVT) models that bridges this gap without any further training. Rather than construct representations out of only the final layer activations, SOFT individuates and locates object-like entities from PVT attentions, and describes them with PVT activations, producing an object-centric embedding. Across standard choices of generic pre-trained vision transformers PVT, we demonstrate in each case that policies trained on SOFT(PVT) far outstrip standard PVT representations for manipulation tasks in simulated and real settings, approaching the state-of-the-art robotics-aware representations. Code, appendix and videos: https://sites.google.com/view/robot-soft/
翻译:通用可复用的预训练图像表征编码器已成为众多计算机视觉任务方法的标准组件。然而作为机器人的视觉表征,其效用一直受限,这推动了近期一系列预训练机器人专用图像编码器的研究浪潮,这些编码器比通用版本更适配机器人任务。我们提出基于Transformer的场景物体提取方法(简称SOFT),该方法通过封装预训练视觉Transformer(PVT)模型来弥合这一差距,且无需额外训练。SOFT并非仅利用最终层激活构建表征,而是从PVT注意力机制中个体化定位类物体实体,并用PVT激活特征描述它们,从而生成以物体为中心的嵌入表征。通过对多种标准通用预训练视觉Transformer(PVT)的验证,我们证明在所有案例中,基于SOFT(PVT)训练的策略在模拟和真实环境下的操作任务中,其性能远超标准PVT表征,逼近最先进的机器人感知表征水平。代码、附录及视频:https://sites.google.com/view/robot-soft/