Learning a generalizable object manipulation policy is vital for an embodied agent to work in complex real-world scenes. Parts, as the shared components in different object categories, have the potential to increase the generalization ability of the manipulation policy and achieve cross-category object manipulation. In this work, we build the first large-scale, part-based cross-category object manipulation benchmark, PartManip, which is composed of 11 object categories, 494 objects, and 1432 tasks in 6 task classes. Compared to previous work, our benchmark is also more diverse and realistic, i.e., having more objects and using sparse-view point cloud as input without oracle information like part segmentation. To tackle the difficulties of vision-based policy learning, we first train a state-based expert with our proposed part-based canonicalization and part-aware rewards, and then distill the knowledge to a vision-based student. We also find an expressive backbone is essential to overcome the large diversity of different objects. For cross-category generalization, we introduce domain adversarial learning for domain-invariant feature extraction. Extensive experiments in simulation show that our learned policy can outperform other methods by a large margin, especially on unseen object categories. We also demonstrate our method can successfully manipulate novel objects in the real world.
翻译:学习可泛化的物体操作策略对于具身智能体在复杂真实场景中工作至关重要。部件作为不同物体类别中的共享组件,具有提升操作策略泛化能力、实现跨类别物体操作的潜力。本研究构建了首个大规模基于部件的跨类别物体操作基准PartManip,包含11个物体类别、494个物体及6类任务中的1432个任务。相较于现有工作,该基准更具多样性和真实性,即拥有更多物体,并以稀疏视角点云为输入,不依赖部件分割等先验信息。为攻克基于视觉策略学习的难点,我们首先利用所提出的部件规范化与部件感知奖励训练基于状态的专业策略,再将知识蒸馏至基于视觉的学生模型。同时发现,表达性强的骨干网络对于应对不同物体间的巨大多样性至关重要。针对跨类别泛化,我们引入领域对抗学习以实现领域不变特征提取。仿真实验表明,所学策略在未见物体类别上显著优于其他方法。我们还展示了该方法在真实世界中成功操作新物体的能力。