We introduce the new setting of open-vocabulary object 6D pose estimation, in which a textual prompt is used to specify the object of interest. In contrast to existing approaches, in our setting (i) the object of interest is specified solely through the textual prompt, (ii) no object model (e.g., CAD or video sequence) is required at inference, and (iii) the object is imaged from two RGBD viewpoints of different scenes. To operate in this setting, we introduce a novel approach that leverages a Vision-Language Model to segment the object of interest from the scenes and to estimate its relative 6D pose. The key of our approach is a carefully devised strategy to fuse object-level information provided by the prompt with local image features, resulting in a feature space that can generalize to novel concepts. We validate our approach on a new benchmark based on two popular datasets, REAL275 and Toyota-Light, which collectively encompass 34 object instances appearing in four thousand image pairs. The results demonstrate that our approach outperforms both a well-established hand-crafted method and a recent deep learning-based baseline in estimating the relative 6D pose of objects in different scenes. Code and dataset are available at https://jcorsetti.github.io/oryon.
翻译:本文提出了开放词汇物体六自由度姿态估计这一新设定,其中使用文本提示来指定感兴趣的目标物体。与现有方法相比,我们的设定具有以下特点:(i) 感兴趣物体仅通过文本提示指定;(ii) 推理时无需物体模型(如CAD模型或视频序列);(iii) 物体通过不同场景的两个RGBD视角成像。为在此设定下运行,我们提出了一种新颖方法,利用视觉-语言模型从场景中分割出感兴趣物体并估计其相对六自由度姿态。该方法的核心在于精心设计的策略,将文本提示提供的物体级信息与局部图像特征相融合,从而构建出能够泛化到新概念的特征空间。我们在基于两个流行数据集REAL275和Toyota-Light构建的新基准上验证了所提方法,该基准共包含34个物体实例,出现在四千对图像中。实验结果表明,在估计不同场景中物体的相对六自由度姿态时,我们的方法优于既有的经典手工设计方法和近期基于深度学习的基线方法。代码与数据集已公开于 https://jcorsetti.github.io/oryon。