Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck for current 3D recognition approaches is that they do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse kinds of real-world applications. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require high-quality labels to train neural networks, which merely perform well in a fully supervised manner. This work presents a generalized and simple framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. All codes, models, and data are made publicly available at: https://drive.google.com/drive/folders/1M58V-PtR8DBEwD296zJkNg_m2qq-MTAP?usp=sharing.

翻译：深度神经网络模型在封闭集设定且具备完整标签的条件下，已在三维场景理解领域取得显著进展。然而，当前三维识别方法的主要瓶颈在于：面对现实世界中多样化的应用场景时，它们无法识别训练类别之外的任何未见新类别。同时，现有最先进的三维场景理解方法主要依赖高质量标签来训练神经网络，且仅在全监督模式下表现优异。本文提出一种通用且简洁的框架，用于处理标签极为有限情况下的三维场景理解任务。为从预训练的视觉语言模型中提取关于新类别的知识，我们提出层级特征对齐预训练与知识蒸馏策略，从大规模视觉语言模型中提取并蒸馏有意义的信息，从而助力开放词汇场景理解任务。为利用边界信息，我们提出一种新颖的基于能量的损失函数，通过区域级边界预测实现边界感知。为促进潜在实例判别并保证效率，我们提出针对点云的无监督区域级语义对比学习方案，利用神经网络的高置信度预测，在多个阶段对中间特征嵌入进行判别。在室内与室外场景中的大量实验证明了本方法在数据高效学习与开放世界少样本学习中的有效性。所有代码、模型与数据均已开源，访问地址为：https://drive.google.com/drive/folders/1M58V-PtR8DBEwD296zJkNg_m2qq-MTAP?usp=sharing。