Zero-shot point cloud segmentation aims to make deep models capable of recognizing novel objects in point cloud that are unseen in the training phase. Recent trends favor the pipeline which transfers knowledge from seen classes with labels to unseen classes without labels. They typically align visual features with semantic features obtained from word embedding by the supervision of seen classes' annotations. However, point cloud contains limited information to fully match with semantic features. In fact, the rich appearance information of images is a natural complement to the textureless point cloud, which is not well explored in previous literature. Motivated by this, we propose a novel multi-modal zero-shot learning method to better utilize the complementary information of point clouds and images for more accurate visual-semantic alignment. Extensive experiments are performed in two popular benchmarks, i.e., SemanticKITTI and nuScenes, and our method outperforms current SOTA methods with 52% and 49% improvement on average for unseen class mIoU, respectively.
翻译:零样本点云分割旨在使深度模型能够识别训练阶段未见过的点云中的新物体。当前趋势倾向于采用一种流程,即通过带标签的已知类别知识迁移至无标签的未知类别。这些方法通常利用已知类别注释的监督,将视觉特征与从词嵌入中获取的语义特征进行对齐。然而,点云包含的有限信息难以与语义特征完全匹配。实际上,图像丰富的表观信息可作为无纹理点云的自然补充,而此前文献对此探索不足。受此启发,我们提出一种新型多模态零样本学习方法,以更好地利用点云与图像的互补信息,实现更精准的视觉-语义对齐。我们在两个主流基准数据集(即SemanticKITTI和nuScenes)上进行了大量实验,所提方法在未知类别的平均交并比上分别较现有最先进方法提升52%和49%。