Estimating the 6D pose of objects unseen during training is highly desirable yet challenging. Zero-shot object 6D pose estimation methods address this challenge by leveraging additional task-specific supervision provided by large-scale, photo-realistic synthetic datasets. However, their performance heavily depends on the quality and diversity of rendered data and they require extensive training. In this work, we show how to tackle the same task but without training on specific data. We propose FreeZe, a novel solution that harnesses the capabilities of pre-trained geometric and vision foundation models. FreeZe leverages 3D geometric descriptors learned from unrelated 3D point clouds and 2D visual features learned from web-scale 2D images to generate discriminative 3D point-level descriptors. We then estimate the 6D pose of unseen objects by 3D registration based on RANSAC. We also introduce a novel algorithm to solve ambiguous cases due to geometrically symmetric objects that is based on visual features. We comprehensively evaluate FreeZe across the seven core datasets of the BOP Benchmark, which include over a hundred 3D objects and 20,000 images captured in various scenarios. FreeZe consistently outperforms all state-of-the-art approaches, including competitors extensively trained on synthetic 6D pose estimation data. Code will be publicly available at https://andreacaraffa.github.io/freeze.
翻译:训练中未见物体的6D姿态估计极具挑战性但备受期待。零样本物体6D姿态估计方法通过利用大规模、照片级逼真合成数据集提供的额外任务特定监督信息来解决这一挑战。然而,其性能严重依赖渲染数据的质量和多样性,且需要进行大量训练。本文展示了如何在不依赖特定数据训练的情况下完成相同任务。我们提出FreeZe,一种利用预训练几何与视觉基础模型能力的新型解决方案。FreeZe利用从无关3D点云中学习的3D几何描述符与从网络级2D图像中学习的2D视觉特征,生成具有判别力的3D点级描述符。随后通过基于RANSAC的3D配准估计未见物体的6D姿态。针对几何对称物体导致的歧义情况,我们进一步提出了一种基于视觉特征的新颖求解算法。我们在BOP基准测试的七个核心数据集上对FreeZe进行了全面评估,这些数据集涵盖百余个3D物体和2万张不同场景下拍摄的图像。FreeZe始终优于所有最先进方法,包括在合成6D姿态估计数据上经过大量训练的竞争对手。代码将于 https://andreacaraffa.github.io/freeze 公开提供。