3D semantic segmentation is essential for autonomous driving and road infrastructure analysis, but state-of-the-art 3D models suffer from severe domain shift when applied across datasets. We propose a multi-view projection framework for unsupervised domain adaptation (UDA). Our method aligns LiDAR scans into coherent 3D scenes and renders them from multiple virtual camera poses to generate large-scale synthetic 2D datasets (PC2D) in various modalities. An ensemble of 2D segmentation models is trained on these modalities, and during inference, hundreds of views per scene are processed, with logits back-projected to 3D using an occlusion-aware voting scheme to produce point-wise labels. These labels can be used directly or to fine-tune a 3D segmentation model in the target domain. We evaluate our approach in both Real-to-Real and Simulation-to-Real UDA, achieving state-of-the-art performance in the Real-to-Real setting. Furthermore, we show that our framework enables segmentation of rare classes, leveraging only 2D annotations for those classes while relying on 3D annotations for others in the source domain.
翻译:三维语义分割对于自动驾驶和道路基础设施分析至关重要,但最先进的三维模型在跨数据集应用时存在严重的域偏移问题。本文提出一种用于无监督域自适应的多视角投影框架。该方法将激光雷达扫描数据对齐为连贯的三维场景,并从多个虚拟相机位姿进行渲染,以生成多种模态的大规模合成二维数据集(PC2D)。在这些模态上训练一个二维分割模型集成,在推理过程中,每个场景处理数百个视角,并使用遮挡感知投票方案将逻辑值反投影至三维空间,以生成逐点标签。这些标签可直接使用,也可用于在目标域微调三维分割模型。我们在真实到真实及仿真到真实的无监督域自适应场景中评估了所提方法,在真实到真实设定中取得了最先进的性能。此外,我们证明了该框架能够实现稀有类别的分割,仅需这些类别的二维标注,而源域中其他类别仍依赖三维标注。