3D open-vocabulary scene understanding aims to recognize arbitrary novel categories beyond the base label space. However, existing works not only fail to fully utilize all the available modal information in the 3D domain but also lack sufficient granularity in representing the features of each modality. In this paper, we propose a unified multimodal 3D open-vocabulary scene understanding network, namely UniM-OV3D, which aligns point clouds with image, language and depth. To better integrate global and local features of the point clouds, we design a hierarchical point cloud feature extraction module that learns comprehensive fine-grained feature representations. Further, to facilitate the learning of coarse-to-fine point-semantic representations from captions, we propose the utilization of hierarchical 3D caption pairs, capitalizing on geometric constraints across various viewpoints of 3D scenes. Extensive experimental results demonstrate the effectiveness and superiority of our method in open-vocabulary semantic and instance segmentation, which achieves state-of-the-art performance on both indoor and outdoor benchmarks such as ScanNet, ScanNet200, S3IDS and nuScenes. Code is available at https://github.com/hithqd/UniM-OV3D.
翻译:三维开放词汇场景理解旨在识别超越基准标签空间的任意新颖类别。然而,现有方法不仅未能充分利用三维领域中所有可用的模态信息,而且在表征每个模态的特征方面也缺乏足够的粒度。本文提出了一种统一的多模态三维开放词汇场景理解网络,即UniM-OV3D,该网络将点云与图像、语言和深度进行对齐。为了更好地整合点云的全局与局部特征,我们设计了一个层次化点云特征提取模块,用于学习全面的细粒度特征表示。此外,为促进从描述中学习由粗到精的点语义表征,我们提出利用层次化三维描述对,并充分利用三维场景中多个视角的几何约束。大量实验结果表明,我们的方法在开放词汇语义分割和实例分割中具有有效性和优越性,在ScanNet、ScanNet200、S3IDS和nuScenes等室内外基准数据集上均达到了最先进的性能。代码已在https://github.com/hithqd/UniM-OV3D 开源。