Existing 3D scene understanding tasks have achieved high performance on close-set benchmarks but fail to handle novel categories in real-world applications. To this end, we propose a Regional Point-Language Contrastive learning framework, namely RegionPLC, for open-world 3D scene understanding, which equips models trained on closed-set datasets with open-vocabulary recognition capabilities. We propose dense visual prompts to elicit region-level visual-language knowledge from 2D foundation models via captioning, which further allows us to build dense regional point-language associations. Then, we design a point-discriminative contrastive learning objective to enable point-independent learning from captions for dense scene understanding. We conduct extensive experiments on ScanNet, ScanNet200, and nuScenes datasets. Our RegionPLC significantly outperforms previous base-annotated 3D open-world scene understanding approaches by an average of 11.6\% and 6.6\% for semantic and instance segmentation, respectively. It also shows promising open-world results in absence of any human annotation with low training and inference costs. Code will be released.
翻译:现有三维场景理解任务在封闭集基准上取得了高性能,但在真实世界应用中无法处理新类别。为此,我们提出区域点-语言对比学习框架RegionPLC,用于开放世界三维场景理解,使在封闭集数据集上训练的模型具备开放词汇识别能力。我们引入密集视觉提示,通过字幕生成从二维基础模型中提取区域级视觉-语言知识,进而构建密集的区域点-语言关联。随后,我们设计了点判别对比学习目标,实现从字幕中对点进行独立学习,以支持密集场景理解。在ScanNet、ScanNet200和nuScenes数据集上的大量实验表明,RegionPLC在语义分割和实例分割任务上分别平均超越以往基于基标注的三维开放世界场景理解方法11.6%和6.6%。该方法在无需任何人工标注的情况下展现出良好的开放世界结果,且训练与推理成本较低。代码将开源。