Visual imitation learning provides efficient and intuitive solutions for robotic systems to acquire novel manipulation skills. However, simultaneously learning geometric task constraints and control policies from visual inputs alone remains a challenging problem. In this paper, we propose an approach for keypoint-based visual imitation (K-VIL) that automatically extracts sparse, object-centric, and embodiment-independent task representations from a small number of human demonstration videos. The task representation is composed of keypoint-based geometric constraints on principal manifolds, their associated local frames, and the movement primitives that are then needed for the task execution. Our approach is capable of extracting such task representations from a single demonstration video, and of incrementally updating them when new demonstrations become available. To reproduce manipulation skills using the learned set of prioritized geometric constraints in novel scenes, we introduce a novel keypoint-based admittance controller. We evaluate our approach in several real-world applications, showcasing its ability to deal with cluttered scenes, viewpoint mismatch, new instances of categorical objects, and large object pose and shape variations, as well as its efficiency and robustness in both one-shot and few-shot imitation learning settings. Videos and source code are available at https://sites.google.com/view/k-vil.
翻译:视觉模仿学习为机器人系统获取新颖操作技能提供了高效且直观的解决方案。然而,仅从视觉输入同时学习几何任务约束与控制策略仍是一个具有挑战性的问题。本文提出了一种基于关键点的视觉模仿学习方法(K-VIL),能够从少量人类演示视频中自动提取稀疏、以物体为中心且独立于具身形态的任务表征。该任务表征由主流形上的关键点几何约束、其关联的局部坐标系以及任务执行所需的基本运动原语组成。我们的方法能够从单个演示视频中提取此类任务表征,并在获取新演示时对其进行增量更新。为在未知场景中利用学习到的优先几何约束复现操作技能,我们提出了一种新颖的基于关键点的导纳控制器。我们在多个实际应用场景中评估了该方法,展示了其处理杂乱场景、视角不匹配、新类别物体实例以及物体大范围姿态与形状变化的能力,以及在单次与少量样本模仿学习设置中的效率与鲁棒性。相关视频与源代码见 https://sites.google.com/view/k-vil。