As robotic systems increasingly encounter complex and unconstrained real-world scenarios, there is a demand to recognize diverse objects. The state-of-the-art 6D object pose estimation methods rely on object-specific training and therefore do not generalize to unseen objects. Recent novel object pose estimation methods are solving this issue using task-specific fine-tuned CNNs for deep template matching. This adaptation for pose estimation still requires expensive data rendering and training procedures. MegaPose for example is trained on a dataset consisting of two million images showing 20,000 different objects to reach such generalization capabilities. To overcome this shortcoming we introduce ZS6D, for zero-shot novel object 6D pose estimation. Visual descriptors, extracted using pre-trained Vision Transformers (ViT), are used for matching rendered templates against query images of objects and for establishing local correspondences. These local correspondences enable deriving geometric correspondences and are used for estimating the object's 6D pose with RANSAC-based PnP. This approach showcases that the image descriptors extracted by pre-trained ViTs are well-suited to achieve a notable improvement over two state-of-the-art novel object 6D pose estimation methods, without the need for task-specific fine-tuning. Experiments are performed on LMO, YCBV, and TLESS. In comparison to one of the two methods we improve the Average Recall on all three datasets and compared to the second method we improve on two datasets.
翻译:随着机器人系统日益面临复杂且不受约束的真实场景,识别多样物体的需求日益增长。当前最先进的六维物体姿态估计方法依赖于特定物体的训练,因此无法泛化到未见物体。现有的新物体姿态估计方法通过使用任务特定微调的CNN进行深度模板匹配来解决此问题,但这种针对姿态估计的适配仍然需要昂贵的数据渲染和训练流程。例如,MegaPose在包含两百万张图像、显示两万种不同物体的数据集上训练,才能达到此类泛化能力。为克服这一缺陷,我们提出了ZS6D——一种用于零样本新物体六维姿态估计的方法。利用预训练的视觉变换器(ViT)提取视觉描述符,将渲染模板与物体查询图像进行匹配,并建立局部对应关系。通过这些局部对应关系推导出几何对应,并基于RANSAC的PnP算法估计物体的六维姿态。该方案表明,预训练ViT提取的图像描述符无需任务特定微调,即可显著优于两种最新的新物体六维姿态估计方法。我们在LMO、YCBV和TLESS数据集上进行了实验。与两种方法中的一种相比,我们在全部三个数据集上提升了平均召回率;与第二种方法相比,我们在两个数据集上取得了改进。