We introduce the task of open-vocabulary 3D instance segmentation. Current approaches for 3D instance segmentation can typically only recognize object categories from a pre-defined closed set of classes that are annotated in the training datasets. This results in important limitations for real-world applications where one might need to perform tasks guided by novel, open-vocabulary queries related to a wide variety of objects. Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features for each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods cannot separate multiple object instances. In this work, we address this limitation, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings. Experiments and ablation studies on ScanNet200 and Replica show that OpenMask3D outperforms other open-vocabulary methods, especially on the long-tail distribution. Qualitative experiments further showcase OpenMask3D's ability to segment object properties based on free-form queries describing geometry, affordances, and materials.
翻译:我们提出了开放词汇三维实例分割任务。当前的三维实例分割方法通常只能识别训练数据集中预先定义并标注的封闭类别集合中的对象类别。这在现实应用中造成了重要限制,因为这类应用可能需要根据涉及多种对象的新型开放词汇查询来执行任务。近年来,开放词汇三维场景理解方法通过为场景中每个点学习可查询的特征来应对这一问题。虽然这种表示可直接用于语义分割,但现有方法无法分离多个对象实例。在本工作中,我们解决了这一局限,提出了OpenMask3D——一种用于开放词汇三维实例分割的零样本方法。在预测的类别无关三维实例掩码引导下,我们的模型通过基于CLIP图像嵌入的多视图融合聚合每个掩码的特征。在ScanNet200和Replica上的实验及消融研究表明,OpenMask3D优于其他开放词汇方法,尤其是在长尾分布上表现突出。定性实验进一步展示了OpenMask3D基于描述几何、功能属性和材料等自由形式查询分割对象属性的能力。