How do we imbue robots with the ability to efficiently manipulate unseen objects and transfer relevant skills based on demonstrations? End-to-end learning methods often fail to generalize to novel objects or unseen configurations. Instead, we focus on the task-specific pose relationship between relevant parts of interacting objects. We conjecture that this relationship is a generalizable notion of a manipulation task that can transfer to new objects in the same category; examples include the relationship between the pose of a pan relative to an oven or the pose of a mug relative to a mug rack. We call this task-specific pose relationship "cross-pose" and provide a mathematical definition of this concept. We propose a vision-based system that learns to estimate the cross-pose between two objects for a given manipulation task using learned cross-object correspondences. The estimated cross-pose is then used to guide a downstream motion planner to manipulate the objects into the desired pose relationship (placing a pan into the oven or the mug onto the mug rack). We demonstrate our method's capability to generalize to unseen objects, in some cases after training on only 10 demonstrations in the real world. Results show that our system achieves state-of-the-art performance in both simulated and real-world experiments across a number of tasks. Supplementary information and videos can be found at https://sites.google.com/view/tax-pose/home.
翻译:如何赋予机器人基于演示高效操作未知物体并迁移相关技能的能力?端到端学习方法往往难以泛化到新物体或未见配置。为此,我们聚焦于交互物体相关部件间的任务特定位姿关系。我们推测这种关系是操作任务的一种可泛化概念,能够迁移到同一类别的新物体上——例如平底锅相对于烤箱的位姿关系,或马克杯相对于杯架的位姿关系。我们将这种任务特定的位姿关系称为"交叉位姿",并给出该概念的数学定义。我们提出一个基于视觉的系统,通过学习跨物体对应关系,为给定操作任务估计两个物体间的交叉位姿。估计得到的交叉位姿随后用于引导下游运动规划器,将物体操纵至期望的位姿关系(如将平底锅放入烤箱,或将马克杯挂上杯架)。我们证明了该方法能够泛化到未见物体,在部分情况下仅需10次真实世界演示训练即可实现。实验结果表明,我们的系统在多个任务的仿真和真实世界测试中均达到了最先进的性能。补充信息与视频请见https://sites.google.com/view/tax-pose/home。