This work presents OVIR-3D, a straightforward yet effective method for open-vocabulary 3D object instance retrieval without using any 3D data for training. Given a language query, the proposed method is able to return a ranked set of 3D object instance segments based on the feature similarity of the instance and the text query. This is achieved by a multi-view fusion of text-aligned 2D region proposals into 3D space, where the 2D region proposal network could leverage 2D datasets, which are more accessible and typically larger than 3D datasets. The proposed fusion process is efficient as it can be performed in real-time for most indoor 3D scenes and does not require additional training in 3D space. Experiments on public datasets and a real robot show the effectiveness of the method and its potential for applications in robot navigation and manipulation.
翻译:本文提出OVIR-3D,一种直接且有效的方法,用于在无需使用任何三维数据进行训练的情况下实现开词汇三维物体实例检索。给定语言查询,该方法能够基于实例与文本查询的特征相似度,返回一组排序后的三维物体实例分割结果。这一目标通过将文本对齐的二维区域提议进行多视角融合并映射到三维空间来实现,其中二维区域提议网络可利用更易获取且通常规模更大的二维数据集。所提出的融合过程高效,适用于大多数室内三维场景的实时处理,且无需在三维空间中额外训练。在公开数据集和真实机器人上的实验结果表明了该方法的有效性及其在机器人导航与操作中的潜在应用价值。