In this paper we leverage self-supervised vision transformer models and their emergent semantic abilities to improve the generalization abilities of imitation learning policies. We introduce BC-ViT, an imitation learning algorithm that leverages rich DINO pre-trained Visual Transformer (ViT) patch-level embeddings to obtain better generalization when learning through demonstrations. Our learner sees the world by clustering appearance features into semantic concepts, forming stable keypoints that generalize across a wide range of appearance variations and object types. We show that this representation enables generalized behaviour by evaluating imitation learning across a diverse dataset of object manipulation tasks. Our method, data and evaluation approach are made available to facilitate further study of generalization in Imitation Learners.
翻译:本文利用自监督视觉Transformer模型及其涌现的语义能力,提升模仿学习策略的泛化能力。我们提出BC-ViT算法,该算法通过利用DINO预训练的视觉Transformer(ViT)块级嵌入,在示范中学习时获得更强的泛化性能。我们的学习器通过将外观特征聚类为语义概念来感知世界,形成能跨广泛外观变化和物体类型泛化的稳定关键点。通过在多类物体操作任务数据集上评估模仿学习,本文展示了该表征如何实现通用行为。为促进模仿学习泛化性的进一步研究,我们公开了方法、数据与评估方案。