The field of visual representation learning has seen explosive growth in the past years, but its benefits in robotics have been surprisingly limited so far. Prior work uses generic visual representations as a basis to learn (task-specific) robot action policies (e.g., via behavior cloning). While the visual representations do accelerate learning, they are primarily used to encode visual observations. Thus, action information has to be derived purely from robot data, which is expensive to collect! In this work, we present a scalable alternative where the visual representations can help directly infer robot actions. We observe that vision encoders express relationships between image observations as distances (e.g., via embedding dot product) that could be used to efficiently plan robot behavior. We operationalize this insight and develop a simple algorithm for acquiring a distance function and dynamics predictor, by fine-tuning a pre-trained representation on human collected video sequences. The final method is able to substantially outperform traditional robot learning baselines (e.g., 70% success v.s. 50% for behavior cloning on pick-place) on a suite of diverse real-world manipulation tasks. It can also generalize to novel objects, without using any robot demonstrations during train time. For visualizations of the learned policies please check: https://agi-labs.github.io/manipulate-by-seeing/.
翻译:视觉表征学习领域在过去数年间呈现爆炸式增长,但其在机器人领域的应用成效却出人意料地有限。既有工作将通用视觉表征作为学习(任务特定)机器人动作策略的基础(例如通过行为克隆)。尽管视觉表征确实加速了学习过程,但它们主要被用于编码视觉观测信息。因此,动作信息必须纯粹从昂贵的机器人数据中推导得出!本文提出一种可扩展的替代方案,使视觉表征能够直接辅助机器人动作推断。我们观察到视觉编码器通过距离度量(如嵌入向量点积)表达图像观测间的关联,这一特性可用于高效规划机器人行为。基于这一洞见,我们开发了简易算法:通过对人类采集视频序列上的预训练表征进行微调,获取距离函数与动力学预测器。该最终方法在多样化真实世界操控任务套件中显著超越传统机器人学习基线(例如,抓取放置任务成功率70% vs 行为克隆50%),且无需在训练阶段使用任何机器人演示即可泛化至新颖物体。可视化学习策略请参见:https://agi-labs.github.io/manipulate-by-seeing/