LiDAR Semantic Segmentation is a fundamental task in autonomous driving perception consisting of associating each LiDAR point to a semantic label. Fully-supervised models have widely tackled this task, but they require labels for each scan, which either limits their domain or requires impractical amounts of expensive annotations. Camera images, which are generally recorded alongside LiDAR pointclouds, can be processed by the widely available 2D foundation models, which are generic and dataset-agnostic. However, distilling knowledge from 2D data to improve LiDAR perception raises domain adaptation challenges. For example, the classical perspective projection suffers from the parallax effect produced by the position shift between both sensors at their respective capture times. We propose a Semi-Supervised Learning setup to leverage unlabeled LiDAR pointclouds alongside distilled knowledge from the camera images. To self-supervise our model on the unlabeled scans, we add an auxiliary NeRF head and cast rays from the camera viewpoint over the unlabeled voxel features. The NeRF head predicts densities and semantic logits at each sampled ray location which are used for rendering pixel semantics. Concurrently, we query the Segment-Anything (SAM) foundation model with the camera image to generate a set of unlabeled generic masks. We fuse the masks with the rendered pixel semantics from LiDAR to produce pseudo-labels that supervise the pixel predictions. During inference, we drop the NeRF head and run our model with only LiDAR. We show the effectiveness of our approach in three public LiDAR Semantic Segmentation benchmarks: nuScenes, SemanticKITTI and ScribbleKITTI.
翻译:LiDAR语义分割是自动驾驶感知中的一项基础任务,其目标是为每个LiDAR点分配语义标签。全监督模型已广泛处理此任务,但它们需要每帧扫描的标签,这要么限制了其应用领域,要么需要大量昂贵且不切实际的标注。通常与LiDAR点云同步记录的相机图像,可通过广泛可用的通用且与数据集无关的二维基础模型进行处理。然而,从二维数据中蒸馏知识以改进LiDAR感知会引发域适应挑战。例如,经典透视投影会因两个传感器在各自采集时刻的位置偏移而产生的视差效应而受到影响。我们提出一种半监督学习框架,以利用未标注的LiDAR点云以及从相机图像中蒸馏的知识。为了在未标注扫描上自监督我们的模型,我们添加了一个辅助NeRF头,并从相机视角向未标注的体素特征投射光线。NeRF头在每个采样光线位置预测密度和语义逻辑值,用于渲染像素语义。同时,我们使用相机图像查询Segment-Anything(SAM)基础模型,以生成一组未标注的通用掩码。我们将这些掩码与从LiDAR渲染的像素语义融合,产生用于监督像素预测的伪标签。在推理阶段,我们移除NeRF头,仅使用LiDAR运行我们的模型。我们在三个公开的LiDAR语义分割基准测试中展示了我们方法的有效性:nuScenes、SemanticKITTI和ScribbleKITTI。