Training a 3D scene understanding model requires complicated human annotations, which are laborious to collect and result in a model only encoding close-set object semantics. In contrast, vision-language pre-training models (e.g., CLIP) have shown remarkable open-world reasoning properties. To this end, we propose directly transferring CLIP's feature space to 3D scene understanding model without any form of supervision. We first modify CLIP's input and forwarding process so that it can be adapted to extract dense pixel features for 3D scene contents. We then project multi-view image features to the point cloud and train a 3D scene understanding model with feature distillation. Without any annotations or additional training, our model achieves promising annotation-free semantic segmentation results on open-vocabulary semantics and long-tailed concepts. Besides, serving as a cross-modal pre-training framework, our method can be used to improve data efficiency during fine-tuning. Our model outperforms previous SOTA methods in various zero-shot and data-efficient learning benchmarks. Most importantly, our model successfully inherits CLIP's rich-structured knowledge, allowing 3D scene understanding models to recognize not only object concepts but also open-world semantics.
翻译:训练三维场景理解模型需要复杂的人工标注,这些标注收集起来极为繁琐,且生成的模型仅能编码封闭集的对象语义。相比之下,视觉-语言预训练模型(如CLIP)展现出卓越的开放世界推理能力。为此,我们提出在无需任何监督形式的情况下,直接将CLIP的特征空间迁移至三维场景理解模型。我们首先改进CLIP的输入与处理流程,使其能够适配提取三维场景内容的稠密像素特征;继而将多视角图像特征投影至点云,并通过特征蒸馏训练三维场景理解模型。无需任何标注或额外训练,我们的模型在开放词汇语义与长尾概念上即取得令人瞩目的无标注语义分割结果。此外,作为跨模态预训练框架,我们的方法可在微调过程中提升数据效率。在多种零样本与数据高效学习基准测试中,本模型性能超越现有最优方法。更为关键的是,本模型成功继承了CLIP丰富的结构化知识,使三维场景理解模型不仅能识别对象概念,更能理解开放世界语义。