OSCAR：基于语言提示与单张图像的开集CAD检索 (OSCAR: Open-Set CAD Retrieval from a Language Prompt and a Single Image)

6D object pose estimation plays a crucial role in scene understanding for applications such as robotics and augmented reality. To support the needs of ever-changing object sets in such context, modern zero-shot object pose estimators were developed to not require object-specific training but only rely on CAD models. Such models are hard to obtain once deployed, and a continuously changing and growing set of objects makes it harder to reliably identify the instance model of interest. To address this challenge, we introduce an Open-Set CAD Retrieval from a Language Prompt and a Single Image (OSCAR), a novel training-free method that retrieves a matching object model from an unlabeled 3D object database. During onboarding, OSCAR generates multi-view renderings of database models and annotates them with descriptive captions using an image captioning model. At inference, GroundedSAM detects the queried object in the input image, and multi-modal embeddings are computed for both the Region-of-Interest and the database captions. OSCAR employs a two-stage retrieval: text-based filtering using CLIP identifies candidate models, followed by image-based refinement using DINOv2 to select the most visually similar object. In our experiments we demonstrate that OSCAR outperforms all state-of-the-art methods on the cross-domain 3D model retrieval benchmark MI3DOR. Furthermore, we demonstrate OSCAR's direct applicability in automating object model sourcing for 6D object pose estimation. We propose using the most similar object model for pose estimation if the exact instance is not available and show that OSCAR achieves an average precision of 90.48\% during object retrieval on the YCB-V object dataset. Moreover, we demonstrate that the most similar object model can be utilized for pose estimation using Megapose achieving better results than a reconstruction-based approach.

翻译：六维物体姿态估计在机器人学与增强现实等应用场景理解中扮演着关键角色。为适应动态变化物体集合的需求，现代零样本物体姿态估计方法被设计为无需针对特定物体进行训练，仅依赖CAD模型即可工作。然而，此类模型一旦部署后难以获取，且持续变化增长的物体集合使得可靠识别目标实例模型变得更为困难。为应对这一挑战，我们提出一种基于语言提示与单张图像的开集CAD检索方法（OSCAR），这是一种无需训练的新颖方法，能够从未标注的三维物体数据库中检索匹配的物体模型。在模型入库阶段，OSCAR通过图像描述生成模型为数据库模型的多视角渲染图自动生成描述性标注。在推理阶段，GroundedSAM检测输入图像中的查询物体，并分别计算感兴趣区域与数据库标注的多模态嵌入向量。OSCAR采用两阶段检索策略：首先利用CLIP进行基于文本的过滤以确定候选模型，随后通过DINOv2进行基于图像的精细化筛选，最终选取视觉相似度最高的物体。实验表明，OSCAR在跨领域三维模型检索基准MI3DOR上超越了所有现有先进方法。此外，我们验证了OSCAR在六维物体姿态估计中自动化获取物体模型的直接适用性：当无法获得完全相同的实例模型时，我们提出采用最相似物体模型进行姿态估计，并在YCB-V物体数据集上实现了90.48%的平均检索精度。进一步实验证明，利用Megapose框架对最相似模型进行姿态估计，能够取得优于基于重建方法的结果。