Zero-shot 6D object pose estimation involves the detection of novel objects with their 6D poses in cluttered scenes, presenting significant challenges for model generalizability. Fortunately, the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance, which provides a promising solution to tackle this task. Motivated by this, we introduce SAM-6D, a novel framework designed to realize the task through two steps, including instance segmentation and pose estimation. Given the target objects, SAM-6D employs two dedicated sub-networks, namely Instance Segmentation Model (ISM) and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D images. ISM takes SAM as an advanced starting point to generate all possible object proposals and selectively preserves valid ones through meticulously crafted object matching scores in terms of semantics, appearance and geometry. By treating pose estimation as a partial-to-partial point matching problem, PEM performs a two-stage point matching process featuring a novel design of background tokens to construct dense 3D-3D correspondence, ultimately yielding the pose estimates. Without bells and whistles, SAM-6D outperforms the existing methods on the seven core datasets of the BOP Benchmark for both instance segmentation and pose estimation of novel objects.
翻译:零样本六维物体姿态估计涉及在杂乱场景中检测新型物体及其六维姿态,对模型的泛化能力提出了重大挑战。幸运的是,近期提出的分割一切模型(SAM)展现了卓越的零样本迁移性能,为处理该任务提供了极具前景的解决方案。受此启发,我们提出SAM-6D,一种通过实例分割和姿态估计两个步骤实现该任务的新型框架。针对目标物体,SAM-6D采用两个专用子网络——实例分割模型(ISM)和姿态估计模型(PEM),在杂乱RGB-D图像上执行上述步骤。ISM以SAM作为高级起点生成所有可能的物体候选区域,并通过精心设计的、融合语义、外观和几何信息的物体匹配分数选择性保留有效区域。通过将姿态估计视为部分到部分的点云匹配问题,PEM执行两阶段点匹配过程,其中创新性地设计了背景令牌以构建密集的3D-3D对应关系,最终输出姿态估计结果。无需繁琐的附加模块,SAM-6D在BOP基准测试的七个核心数据集上,针对新型物体的实例分割和姿态估计任务均超越了现有方法。