We introduce the task of open-vocabulary 3D instance segmentation. Traditional approaches for 3D instance segmentation largely rely on existing 3D annotated datasets, which are restricted to a closed-set of object categories. This is an important limitation for real-life applications where one might need to perform tasks guided by novel, open-vocabulary queries related to objects from a wide variety. Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features per each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods have limitations in their ability to identify object instances. In this work, we address this limitation, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings. We conduct experiments and ablation studies on the ScanNet200 dataset to evaluate the performance of OpenMask3D, and provide insights about the open-vocabulary 3D instance segmentation task. We show that our approach outperforms other open-vocabulary counterparts, particularly on the long-tail distribution. Furthermore, OpenMask3D goes beyond the limitations of close-vocabulary approaches, and enables the segmentation of object instances based on free-form queries describing object properties such as semantics, geometry, affordances, and material properties.
翻译:我们提出了开放词汇3D实例分割这一任务。传统的3D实例分割方法主要依赖现有的3D标注数据集,但这些数据集局限于封闭的对象类别集合。这对于实际应用是一个重要限制,因为在实际场景中,任务可能需要根据涉及多种对象的新型开放词汇查询进行引导。近年来,开放词汇3D场景理解方法通过为场景中每个点学习可查询特征来解决这一问题。尽管这种表示可直接用于语义分割,但现有方法在识别对象实例方面存在局限性。在本工作中,我们针对这一局限提出了OpenMask3D,这是一种用于开放词汇3D实例分割的零样本方法。在预测的类别无关3D实例掩码的引导下,我们的模型通过基于CLIP图像嵌入的多视图融合来聚合每个掩码的特征。我们在ScanNet200数据集上进行了实验和消融研究,以评估OpenMask3D的性能,并提供了关于开放词汇3D实例分割任务的见解。结果表明,我们的方法在性能上优于其他开放词汇方法,尤其是在长尾分布上表现突出。此外,OpenMask3D超越了封闭词汇方法的限制,能够基于描述对象属性(如语义、几何形状、功能属性和材质属性)的自由形式查询实现对对象实例的分割。