In this paper, we investigate Open-Vocabulary 3D Instance Segmentation (OV-3DIS) with free-form language instructions. Earlier works that rely on only annotated base categories for training suffer from limited generalization to unseen novel categories. Recent works mitigate poor generalizability to novel categories by generating class-agnostic masks or projecting generalized masks from 2D to 3D, but disregard semantic or geometry information, leading to sub-optimal performance. Instead, generating generalizable but semantic-related masks directly from 3D point clouds would result in superior outcomes. In this paper, we introduce Segment any 3D Object with LanguagE (SOLE), which is a semantic and geometric-aware visual-language learning framework with strong generalizability by generating semantic-related masks directly from 3D point clouds. Specifically, we propose a multimodal fusion network to incorporate multimodal semantics in both backbone and decoder. In addition, to align the 3D segmentation model with various language instructions and enhance the mask quality, we introduce three types of multimodal associations as supervision. Our SOLE outperforms previous methods by a large margin on ScanNetv2, ScanNet200, and Replica benchmarks, and the results are even close to the fully-supervised counterpart despite the absence of class annotations in the training. Furthermore, extensive qualitative results demonstrate the versatility of our SOLE to language instructions.
翻译:本文研究基于自由形式语言指令的开放词汇3D实例分割(Open-Vocabulary 3D Instance Segmentation, OV-3DIS)。早期方法仅依赖标注的基础类别进行训练,导致对未见新类别的泛化能力有限。近期研究通过生成类别无关掩码或从二维投影广义掩码至三维空间来缓解泛化性不足的问题,但忽略了语义或几何信息,导致性能次优。相比之下,直接从三维点云生成可泛化且具有语义关联的掩码将取得更优效果。为此,本文提出语言驱动的任意3D物体分割框架(Segment any 3D Object with LanguagE, SOLE),这是一个兼具语义与几何感知的视觉-语言学习框架,通过直接从三维点云生成语义关联掩码实现强泛化能力。具体而言,我们设计多模态融合网络,在骨干网络和解码器中融入多模态语义信息。此外,为对齐三维分割模型与多样化语言指令并提升掩码质量,我们引入三种类型的多模态关联作为监督信号。在ScanNetv2、ScanNet200和Replica基准测试中,SOLE大幅超越现有方法,且其性能在训练过程中无需类别标注的情况下已接近全监督方法。大量定性结果进一步证明了SOLE对语言指令的通用性。