LiDAR and camera are two modalities available for 3D semantic segmentation in autonomous driving. The popular LiDAR-only methods severely suffer from inferior segmentation on small and distant objects due to insufficient laser points, while the robust multi-modal solution is under-explored, where we investigate three crucial inherent difficulties: modality heterogeneity, limited sensor field of view intersection, and multi-modal data augmentation. We propose a multi-modal 3D semantic segmentation model (MSeg3D) with joint intra-modal feature extraction and inter-modal feature fusion to mitigate the modality heterogeneity. The multi-modal fusion in MSeg3D consists of geometry-based feature fusion GF-Phase, cross-modal feature completion, and semantic-based feature fusion SF-Phase on all visible points. The multi-modal data augmentation is reinvigorated by applying asymmetric transformations on LiDAR point cloud and multi-camera images individually, which benefits the model training with diversified augmentation transformations. MSeg3D achieves state-of-the-art results on nuScenes, Waymo, and SemanticKITTI datasets. Under the malfunctioning multi-camera input and the multi-frame point clouds input, MSeg3D still shows robustness and improves the LiDAR-only baseline. Our code is publicly available at \url{https://github.com/jialeli1/lidarseg3d}.
翻译:激光雷达和摄像头是自动驾驶中实现3D语义分割的两种可用模态。仅依赖激光雷达的主流方法因激光点不足而在小目标及远距离目标的分割上表现严重不佳,而鲁棒的多模态方案尚未得到充分探索。我们研究了其中三个关键的固有难点:模态异质性、有限传感器视场交集以及多模态数据增强。我们提出了一种多模态3D语义分割模型(MSeg3D),通过联合模态内特征提取与模态间特征融合来缓解模态异质性。MSeg3D中的多模态融合包括基于几何的特征融合阶段(GF-Phase)、跨模态特征补全以及对所有可见点的基于语义的特征融合阶段(SF-Phase)。通过对激光雷达点云和多摄像头图像分别施加非对称变换,多模态数据增强得以重新激活,从而利用多样化的增强变换来优化模型训练。MSeg3D在nuScenes、Waymo和SemanticKITTI数据集上取得了最先进的结果。在多摄像头输入故障及多帧点云输入情况下,MSeg3D仍展现出鲁棒性,并优于仅依赖激光雷达的基线方法。我们的代码已在\url{https://github.com/jialeli1/lidarseg3d}开源。