While 6D object pose estimation has wide applications across computer vision and robotics, it remains far from being solved due to the lack of annotations. The problem becomes even more challenging when moving to category-level 6D pose, which requires generalization to unseen instances. Current approaches are restricted by leveraging annotations from simulation or collected from humans. In this paper, we overcome this barrier by introducing a self-supervised learning approach trained directly on large-scale real-world object videos for category-level 6D pose estimation in the wild. Our framework reconstructs the canonical 3D shape of an object category and learns dense correspondences between input images and the canonical shape via surface embedding. For training, we propose novel geometrical cycle-consistency losses which construct cycles across 2D-3D spaces, across different instances and different time steps. The learned correspondence can be applied for 6D pose estimation and other downstream tasks such as keypoint transfer. Surprisingly, our method, without any human annotations or simulators, can achieve on-par or even better performance than previous supervised or semi-supervised methods on in-the-wild images. Our project page is: https://kywind.github.io/self-pose .
翻译:尽管6D物体姿态估计在计算机视觉和机器人领域具有广泛应用,但由于缺乏标注数据,该问题远未得到解决。当扩展到需要泛化到未见实例的类别级6D姿态时,挑战更为严峻。现有方法需依赖仿真数据或人工标注。本文通过引入直接在真实世界大规模物体视频上训练的自监督学习方法,突破了这一瓶颈,实现了野外场景下的类别级6D姿态估计。我们的框架重建物体类别的规范3D形状,并通过表面嵌入学习输入图像与规范形状之间的密集对应关系。在训练阶段,我们提出新颖的几何循环一致性损失,该损失在2D-3D空间、不同实例及不同时间步间构建循环。所学习的对应关系可应用于6D姿态估计及关键点迁移等下游任务。令人惊讶的是,我们的方法无需任何人工标注或仿真数据,在野外图像上的性能即可媲美甚至超越现有监督或半监督方法。项目主页:https://kywind.github.io/self-pose。