Zero-shot 6D object pose estimation involves the detection of novel objects with their 6D poses in cluttered scenes, presenting significant challenges for model generalizability. Fortunately, the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance, which provides a promising solution to tackle this task. Motivated by this, we introduce SAM-6D, a novel framework designed to realize the task through two steps, including instance segmentation and pose estimation. Given the target objects, SAM-6D employs two dedicated sub-networks, namely Instance Segmentation Model (ISM) and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D images. ISM takes SAM as an advanced starting point to generate all possible object proposals and selectively preserves valid ones through meticulously crafted object matching scores in terms of semantics, appearance and geometry. By treating pose estimation as a partial-to-partial point matching problem, PEM performs a two-stage point matching process featuring a novel design of background tokens to construct dense 3D-3D correspondence, ultimately yielding the pose estimates. Without bells and whistles, SAM-6D outperforms the existing methods on the seven core datasets of the BOP Benchmark for both instance segmentation and pose estimation of novel objects.
翻译:零样本6D物体姿态估计涉及在杂乱场景中检测新物体及其6D姿态,这对模型的泛化能力构成了重大挑战。幸运的是,近期推出的分割一切模型(SAM)展现了卓越的零样本迁移性能,为该任务的解决提供了极具前景的方案。受此启发,我们提出SAM-6D这一全新框架,通过实例分割和姿态估计两个步骤实现该任务。给定目标物体后,SAM-6D采用两个专用子网络——实例分割模型(ISM)和姿态估计模型(PEM),在杂乱的RGB-D图像上分别执行上述步骤。ISM以SAM作为高级起点生成所有可能的物体候选区域,并通过精心设计的基于语义、外观和几何特征的物体匹配分数,选择性保留有效候选区域。通过将姿态估计视为局部到局部的点匹配问题,PEM执行两阶段点匹配过程,其中创新性地引入背景标记设计来构建密集的3D-3D对应关系,最终输出姿态估计结果。无需任何额外技巧,SAM-6D在BOP基准测试的七个核心数据集上,针对新物体的实例分割与姿态估计任务均超越了现有方法。