The field of visual representation learning has seen explosive growth in the past years, but its benefits in robotics have been surprisingly limited so far. Prior work uses generic visual representations as a basis to learn (task-specific) robot action policies (e.g. via behavior cloning). While the visual representations do accelerate learning, they are primarily used to encode visual observations. Thus, action information has to be derived purely from robot data, which is expensive to collect! In this work, we present a scalable alternative where the visual representations can help directly infer robot actions. We observe that vision encoders express relationships between image observations as distances (e.g. via embedding dot product) that could be used to efficiently plan robot behavior. We operationalize this insight and develop a simple algorithm for acquiring a distance function and dynamics predictor, by fine-tuning a pre-trained representation on human collected video sequences. The final method is able to substantially outperform traditional robot learning baselines (e.g. 70% success v.s. 50% for behavior cloning on pick-place) on a suite of diverse real-world manipulation tasks. It can also generalize to novel objects, without using any robot demonstrations during train time. For visualizations of the learned policies please check: https://agi-labs.github.io/manipulate-by-seeing/
翻译:视觉表征学习领域在过去几年经历了爆发式增长,但其在机器人中的益处迄今令人惊讶地有限。以往工作将通用视觉表征作为基础来学习(任务特定)机器人动作策略(例如通过行为克隆)。虽然视觉表征确实加速了学习,但它们主要被用于编码视觉观测。因此,动作信息必须纯粹从机器人数据中推导,而这类数据采集成本高昂!在本工作中,我们提出一种可扩展的替代方案,其中视觉表征能直接帮助推断机器人动作。我们观察到视觉编码器将图像观测之间的关系表达为距离(例如通过嵌入点积),这些距离可用于高效规划机器人行为。我们将这一见解付诸实践,开发了一种简单的算法,通过在人采集视频序列上微调预训练表征来获取距离函数和动力学预测器。最终方法在一系列多样化的真实世界操控任务中显著超越传统机器人学习基线(例如在拾取-放置任务中成功率达70% vs 行为克隆的50%)。该方法还能泛化到新物体,且在训练阶段无需使用任何机器人示范。关于学习策略的可视化,请参见:https://agi-labs.github.io/manipulate-by-seeing/