The accurate estimation of six degrees-of-freedom (6DoF) object poses is essential for many applications in robotics and augmented reality. However, existing methods for 6DoF pose estimation often depend on CAD templates or dense support views, restricting their usefulness in realworld situations. In this study, we present a new cascade framework named Cas6D for few-shot 6DoF pose estimation that is generalizable and uses only RGB images. To address the false positives of target object detection in the extreme few-shot setting, our framework utilizes a selfsupervised pre-trained ViT to learn robust feature representations. Then, we initialize the nearest top-K pose candidates based on similarity score and refine the initial poses using feature pyramids to formulate and update the cascade warped feature volume, which encodes context at increasingly finer scales. By discretizing the pose search range using multiple pose bins and progressively narrowing the pose search range in each stage using predictions from the previous stage, Cas6D can overcome the large gap between pose candidates and ground truth poses, which is a common failure mode in sparse-view scenarios. Experimental results on the LINEMOD and GenMOP datasets demonstrate that Cas6D outperforms state-of-the-art methods by 9.2% and 3.8% accuracy (Proj-5) under the 32-shot setting compared to OnePose++ and Gen6D.
翻译:准确估计六自由度(6DoF)物体姿态对于机器人和增强现实中的诸多应用至关重要。然而,现有的6DoF姿态估计方法往往依赖于CAD模板或密集支撑视图,限制了它们在真实场景中的实用性。本研究提出了一种名为Cas6D的新型级联框架,用于可泛化的少样本6DoF姿态估计,该方法仅使用RGB图像。为解决极端少样本设置下目标物体检测的假阳性问题,我们的框架利用自监督预训练的ViT学习鲁棒特征表示。然后,基于相似度分数初始化最近邻的前K个姿态候选,并使用特征金字塔来构建和更新级联扭曲特征体,该特征体以逐渐精细的尺度编码上下文。通过利用多个姿态箱离散化姿态搜索范围,并在每个阶段利用前一阶段的预测逐步缩小姿态搜索范围,Cas6D能够克服稀疏视图场景下常见的姿态候选与真实姿态之间的巨大差距。在LINEMOD和GenMOP数据集上的实验结果表明,在32-shot设置下,与OnePose++和Gen6D相比,Cas6D在Proj-5精度上分别提升了9.2%和3.8%。