We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting with a novel object, DINOBot first uses these features to retrieve the most visually similar object experienced during human demonstrations, and then uses this object to align its end-effector with the novel object to enable effective interaction. Through a series of real-world experiments on everyday tasks, we show that exploiting both the image-level and pixel-level properties of vision foundation models enables unprecedented learning efficiency and generalisation. Videos and code are available at https://www.robot-learning.uk/dinobot.
翻译:我们提出DINOBot,一种全新的机器人操作模仿学习框架,该框架利用经DINO训练的视觉Transformer所提取的图像级与像素级特征能力。当与未知物体交互时,DINOBot首先利用这些特征检索人类演示经验中视觉相似度最高的物体,随后通过该物体将其末端执行器与未知物体对齐,从而实现有效交互。通过一系列日常任务的真实世界实验,我们证明同时利用视觉基础模型的图像级与像素级属性可带来前所未有的学习效率与泛化能力。相关视频与代码已发布于https://www.robot-learning.uk/dinobot。