CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP

Training a 3D scene understanding model requires complicated human annotations, which are laborious to collect and result in a model only encoding close-set object semantics. In contrast, vision-language pre-training models (e.g., CLIP) have shown remarkable open-world reasoning properties. To this end, we propose directly transferring CLIP's feature space to 3D scene understanding model without any form of supervision. We first modify CLIP's input and forwarding process so that it can be adapted to extract dense pixel features for 3D scene contents. We then project multi-view image features to the point cloud and train a 3D scene understanding model with feature distillation. Without any annotations or additional training, our model achieves promising annotation-free semantic segmentation results on open-vocabulary semantics and long-tailed concepts. Besides, serving as a cross-modal pre-training framework, our method can be used to improve data efficiency during fine-tuning. Our model outperforms previous SOTA methods in various zero-shot and data-efficient learning benchmarks. Most importantly, our model successfully inherits CLIP's rich-structured knowledge, allowing 3D scene understanding models to recognize not only object concepts but also open-world semantics.

翻译：训练三维场景理解模型需要复杂的人工标注，这些标注收集起来极为繁琐，且生成的模型仅能编码封闭集的对象语义。相比之下，视觉-语言预训练模型（如CLIP）展现出卓越的开放世界推理能力。为此，我们提出在无需任何监督形式的情况下，直接将CLIP的特征空间迁移至三维场景理解模型。我们首先改进CLIP的输入与处理流程，使其能够适配提取三维场景内容的稠密像素特征；继而将多视角图像特征投影至点云，并通过特征蒸馏训练三维场景理解模型。无需任何标注或额外训练，我们的模型在开放词汇语义与长尾概念上即取得令人瞩目的无标注语义分割结果。此外，作为跨模态预训练框架，我们的方法可在微调过程中提升数据效率。在多种零样本与数据高效学习基准测试中，本模型性能超越现有最优方法。更为关键的是，本模型成功继承了CLIP丰富的结构化知识，使三维场景理解模型不仅能识别对象概念，更能理解开放世界语义。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/