We propose the SAL (Segment Anything in Lidar) method consisting of a text-promptable zero-shot model for segmenting and classifying any object in Lidar, and a pseudo-labeling engine that facilitates model training without manual supervision. While the established paradigm for Lidar Panoptic Segmentation (LPS) relies on manual supervision for a handful of object classes defined a priori, we utilize 2D vision foundation models to generate 3D supervision ``for free''. Our pseudo-labels consist of instance masks and corresponding CLIP tokens, which we lift to Lidar using calibrated multi-modal data. By training our model on these labels, we distill the 2D foundation models into our Lidar SAL model. Even without manual labels, our model achieves $91\%$ in terms of class-agnostic segmentation and $54\%$ in terms of zero-shot Lidar Panoptic Segmentation of the fully supervised state-of-the-art. Furthermore, we outperform several baselines that do not distill but only lift image features to 3D. More importantly, we demonstrate that SAL supports arbitrary class prompts, can be easily extended to new datasets, and shows significant potential to improve with increasing amounts of self-labeled data. Code and models are available at this $\href{https://github.com/nv-dvl/segment-anything-lidar}{URL}$.
翻译:我们提出了SAL(激光雷达通用分割)方法,该方法包含一个可通过文本提示的零样本模型,用于分割和分类激光雷达中的任意目标,以及一个无需人工监督即可促进模型训练的伪标注引擎。现有的激光雷达全景分割范式依赖于对预先定义的少量目标类别进行人工监督,而我们利用二维视觉基础模型“免费”生成三维监督信号。我们的伪标注包含实例掩码及对应的CLIP标记,通过标定后的多模态数据将其提升至激光雷达空间。通过在这些标注上训练模型,我们将二维基础模型的知识蒸馏到激光雷达SAL模型中。即使在没有人工标注的情况下,我们的模型在类别无关分割任务上达到了全监督最优方法91%的性能,在零样本激光雷达全景分割任务上达到54%的性能。此外,我们超越了多个仅提升图像特征至三维空间而未进行蒸馏的基线方法。更重要的是,我们证明了SAL支持任意类别提示,可轻松扩展到新数据集,并展现出随自标注数据量增加而持续提升的巨大潜力。代码与模型可通过此$\href{https://github.com/nv-dvl/segment-anything-lidar}{URL}$获取。