Most recent 3D instance segmentation methods are open vocabulary, offering a greater flexibility than closed-vocabulary methods. Yet, they are limited to reasoning within a specific set of concepts, \ie the vocabulary, prompted by the user at test time. In essence, these models cannot reason in an open-ended fashion, i.e., answering "List the objects in the scene.''. We introduce the first method to address 3D instance segmentation in a setting that is void of any vocabulary prior, namely a vocabulary-free setting. We leverage a large vision-language assistant and an open-vocabulary 2D instance segmenter to discover and ground semantic categories on the posed images. To form 3D instance mask, we first partition the input point cloud into dense superpoints, which are then merged into 3D instance masks. We propose a novel superpoint merging strategy via spectral clustering, accounting for both mask coherence and semantic coherence that are estimated from the 2D object instance masks. We evaluate our method using ScanNet200 and Replica, outperforming existing methods in both vocabulary-free and open-vocabulary settings. Code will be made available. Project page: https://gfmei.github.io/PoVo
翻译:当前大多数三维实例分割方法采用开放词汇策略,相比封闭词汇方法具有更高的灵活性。然而,这些方法仍受限于用户测试时指定的特定概念集合(即词汇表)内的推理。本质上,此类模型无法进行开放式推理,例如回答"列举场景中的物体"。本文首次提出在完全无先验词汇约束(即无词汇设定)下解决三维实例分割问题的方法。我们利用大规模视觉语言助手和开放词汇二维实例分割器,在配准图像中发现语义类别并进行语义对齐。为生成三维实例掩码,首先将输入点云划分为稠密超点,再将其合并为三维实例掩码。我们提出通过谱聚类实现的新型超点合并策略,该策略综合考虑了从二维物体实例掩码中估计的掩码一致性与语义一致性。我们在ScanNet200和Replica数据集上评估本方法,在无词汇和开放词汇设定下均优于现有方法。代码即将开源。项目页面:https://gfmei.github.io/PoVo