Visual imitation learning provides efficient and intuitive solutions for robotic systems to acquire novel manipulation skills. However, simultaneously learning geometric task constraints and control policies from visual inputs alone remains a challenging problem. In this paper, we propose an approach for keypoint-based visual imitation (K-VIL) that automatically extracts sparse, object-centric, and embodiment-independent task representations from a small number of human demonstration videos. The task representation is composed of keypoint-based geometric constraints on principal manifolds, their associated local frames, and the movement primitives that are then needed for the task execution. Our approach is capable of extracting such task representations from a single demonstration video, and of incrementally updating them when new demonstrations become available. To reproduce manipulation skills using the learned set of prioritized geometric constraints in novel scenes, we introduce a novel keypoint-based admittance controller. We evaluate our approach in several real-world applications, showcasing its ability to deal with cluttered scenes, viewpoint mismatch, new instances of categorical objects, and large object pose and shape variations, as well as its efficiency and robustness in both one-shot and few-shot imitation learning settings. Videos and source code are available at https://sites.google.com/view/k-vil.
翻译:视觉模仿学习为机器人系统获取新颖操作技能提供了高效且直观的解决方案。然而,仅从视觉输入中同时学习几何任务约束和控制策略仍是一个具有挑战性的问题。本文提出了一种基于关键点的视觉模仿方法(K-VIL),该方法能够从少量人类示范视频中自动提取稀疏、以目标为中心且独立于具身形态的任务表征。该任务表征由主流形上的基于关键点的几何约束、其关联的局部坐标系以及任务执行所需的运动基元组成。我们的方法能够从单个示范视频中提取此类任务表征,并在新示范可用时对其进行增量更新。为了利用学习到的优先几何约束集在新场景中复现操作技能,我们引入了一种新颖的基于关键点的导纳控制器。我们在多个真实世界应用中对方法进行了评估,展示了其在处理杂乱场景、视角不匹配、类别目标新实例以及目标大尺度位姿与形状变化方面的能力,同时验证了该方法在单次和少次模仿学习设置中的高效性与鲁棒性。视频和源代码可在 https://sites.google.com/view/k-vil 获取。