The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.
翻译:六维姿态估计任务中对未见物体的泛化极具挑战性。尽管视觉-语言模型(VLMs)能够利用自然语言描述来支持未见物体的六维姿态估计,但这些解决方案的性能仍逊于基于模型的方法。本文提出Horyon,一种基于开放词汇VLM的架构,旨在解决仅通过文本提示描述的未见物体在两个场景之间的相对姿态估计问题。我们利用文本提示在场景中识别未见物体,进而获取高分辨率多尺度特征。这些特征被用于提取跨场景匹配以进行配准。我们在包含四个数据集(即REAL275、Toyota-Light、Linemod和YCB-Video)的大量未见物体基准上评估了模型。我们的方法在所有数据集上均取得了最先进的性能,在平均召回率上以12.6的优势超越了先前最佳方法。