We propose FoundPose, a method for 6D pose estimation of unseen rigid objects from a single RGB image. The method assumes that 3D models of the objects are available but does not require any object-specific training. This is achieved by building upon DINOv2, a recent vision foundation model with impressive generalization capabilities. An online pose estimation stage is supported by a minimal object representation that is built during a short onboarding stage from DINOv2 patch features extracted from rendered object templates. Given a query image with an object segmentation mask, FoundPose first rapidly retrieves a handful of similarly looking templates by a DINOv2-based bag-of-words approach. Pose hypotheses are then generated from 2D-3D correspondences established by matching DINOv2 patch features between the query image and a retrieved template, and finally optimized by featuremetric refinement. The method can handle diverse objects, including challenging ones with symmetries and without any texture, and noticeably outperforms existing RGB methods for coarse pose estimation in both accuracy and speed on the standard BOP benchmark. With the featuremetric and additional MegaPose refinement, which are demonstrated complementary, the method outperforms all RGB competitors. Source code is at: evinpinar.github.io/foundpose.
翻译:本文提出FoundPose方法,用于从单张RGB图像对未知刚体物体进行6D姿态估计。该方法假设已知物体的3D模型,但无需针对特定物体进行训练。我们通过利用DINOv2(一种具有显著泛化能力的近期视觉基础模型)实现这一目标。在线姿态估计阶段依赖于最小化物体表示,该表示在简短的导入阶段通过从渲染物体模板提取的DINOv2图像块特征构建而成。给定包含物体分割掩码的查询图像,FoundPose首先通过基于DINOv2的词袋方法快速检索少数相似模板。随后通过匹配查询图像与检索模板之间的DINOv2图像块特征建立2D-3D对应关系生成姿态假设,并最终通过特征度量优化实现姿态精化。该方法可处理各类物体(包括具有对称性及无纹理的挑战性物体),在标准BOP基准测试中,其粗姿态估计的精度和速度均显著优于现有RGB方法。结合已证明具有互补性的特征度量与附加MegaPose优化方法,本方法性能超越所有RGB竞争方法。源代码地址:evinpinar.github.io/foundpose。