FoundPose: Unseen Object Pose Estimation with Foundation Features

We propose FoundPose, a model-based method for 6D pose estimation of unseen objects from a single RGB image. The method can quickly onboard new objects using their 3D models without requiring any object- or task-specific training. In contrast, existing methods typically pre-train on large-scale, task-specific datasets in order to generalize to new objects and to bridge the image-to-model domain gap. We demonstrate that such generalization capabilities can be observed in a recent vision foundation model trained in a self-supervised manner. Specifically, our method estimates the object pose from image-to-model 2D-3D correspondences, which are established by matching patch descriptors from the recent DINOv2 model between the image and pre-rendered object templates. We find that reliable correspondences can be established by kNN matching of patch descriptors from an intermediate DINOv2 layer. Such descriptors carry stronger positional information than descriptors from the last layer, and we show their importance when semantic information is ambiguous due to object symmetries or a lack of texture. To avoid establishing correspondences against all object templates, we develop an efficient template retrieval approach that integrates the patch descriptors into the bag-of-words representation and can promptly propose a handful of similarly looking templates. Additionally, we apply featuremetric alignment to compensate for discrepancies in the 2D-3D correspondences caused by coarse patch sampling. The resulting method noticeably outperforms existing RGB methods for refinement-free pose estimation on the standard BOP benchmark with seven diverse datasets and can be seamlessly combined with an existing render-and-compare refinement method to achieve RGB-only state-of-the-art results. Project page: evinpinar.github.io/foundpose.

翻译：我们提出了FoundPose，一种基于模型的、从单张RGB图像估计未见物体6D姿态的方法。该方法能够利用物体的3D模型快速集成新物体，而无需任何针对特定物体或任务的训练。相比之下，现有方法通常需要在大规模、任务特定的数据集上进行预训练，才能泛化到新物体并弥合图像到模型的领域差距。我们证明，这种泛化能力可以在最近以自监督方式训练的视觉基础模型中观察到。具体而言，我们的方法通过建立图像到模型的2D-3D对应关系来估计物体姿态，这些对应关系是通过匹配图像与预渲染物体模板之间来自近期DINOv2模型的图像块描述符来建立的。我们发现，通过对DINOv2中间层提取的图像块描述符进行k近邻匹配，可以建立可靠的对应关系。这些描述符比最后一层的描述符携带更强的位置信息，我们展示了当物体对称性或缺乏纹理导致语义信息模糊时，这些描述符的重要性。为了避免与所有物体模板建立对应关系，我们开发了一种高效的模板检索方法，该方法将图像块描述符集成到词袋表示中，能够快速提出少量外观相似的模板。此外，我们应用特征度量对齐来补偿由粗糙图像块采样引起的2D-3D对应关系中的差异。所得到的方法在包含七个不同数据集的标准化BOP基准测试中，明显优于现有的无需精细化姿态估计的RGB方法，并且可以与现有的渲染-比较精细化方法无缝结合，实现仅使用RGB图像的最先进结果。项目页面：evinpinar.github.io/foundpose。