3D semantic segmentation is a critical task in many real-world applications, such as autonomous driving, robotics, and mixed reality. However, the task is extremely challenging due to ambiguities coming from the unstructured, sparse, and uncolored nature of the 3D point clouds. A possible solution is to combine the 3D information with others coming from sensors featuring a different modality, such as RGB cameras. Recent multi-modal 3D semantic segmentation networks exploit these modalities relying on two branches that process the 2D and 3D information independently, striving to maintain the strength of each modality. In this work, we first explain why this design choice is effective and then show how it can be improved to make the multi-modal semantic segmentation more robust to domain shift. Our surprisingly simple contribution achieves state-of-the-art performances on four popular multi-modal unsupervised domain adaptation benchmarks, as well as better results in a domain generalization scenario.
翻译:3D语义分割是自动驾驶、机器人技术和混合现实等众多实际应用中的关键任务。然而,由于3D点云具有非结构化、稀疏和无颜色的特性,该任务因存在歧义性而极具挑战性。一种可能的解决方案是将3D信息与来自其他传感器(如RGB相机)的不同模态信息相结合。近年来,多模态3D语义分割网络通过采用独立处理2D和3D信息的两个分支来利用这些模态,力求保持每种模态的强度。本研究首先解释了这种设计选择为何有效,并进一步展示了如何对其进行改进,以使多模态语义分割对域偏移更具鲁棒性。这一极为简单的贡献在四个流行的多模态无监督域适应基准数据集上取得了最先进的性能,同时在域泛化场景中也获得了更优的结果。