3D panoptic segmentation is a challenging perception task, which aims to predict both semantic and instance annotations for 3D points in a scene. Although prior 3D panoptic segmentation approaches have achieved great performance on closed-set benchmarks, generalizing to novel categories remains an open problem. For unseen object categories, 2D open-vocabulary segmentation has achieved promising results that solely rely on frozen CLIP backbones and ensembling multiple classification outputs. However, we find that simply extending these 2D models to 3D does not achieve good performance due to poor per-mask classification quality on novel categories. In this paper, we propose the first method to tackle 3D open-vocabulary panoptic segmentation. Our model takes advantage of the fusion between learnable LiDAR features and dense frozen vision CLIP features, using a single classification head to make predictions for both base and novel classes. To further improve the classification performance on novel classes and leverage the CLIP model, we propose two novel loss functions: object-level distillation loss and voxel-level distillation loss. Our experiments on the nuScenes and SemanticKITTI datasets show that our method outperforms strong baselines by a large margin.
翻译:三维全景分割是一项具有挑战性的感知任务,旨在预测场景中三维点的语义和实例标注。尽管先前的三维全景分割方法在封闭集基准上取得了出色性能,但对新类别的泛化仍是一个开放问题。对于未见物体类别,二维开放词汇分割已取得令人瞩目的成果,其仅依赖冻结的CLIP骨干网络并集成多个分类输出。然而,我们发现将这些二维模型简单扩展至三维时,由于新类别的每掩码分类质量较差,性能并不理想。本文首次提出解决三维开放词汇全景分割的方法。我们的模型利用可学习的激光雷达特征与密集冻结视觉CLIP特征的融合,通过单一分类头对基类和新类别进行预测。为进一步提升新类别的分类性能并充分利用CLIP模型,我们提出两种新型损失函数:对象级蒸馏损失和体素级蒸馏损失。在nuScenes和SemanticKITTI数据集上的实验表明,我们的方法以较大优势超越了强基线方法。