Using multiple spatial modalities has been proven helpful in improving semantic segmentation performance. However, there are several real-world challenges that have yet to be addressed: (a) improving label efficiency and (b) enhancing robustness in realistic scenarios where modalities are missing at the test time. To address these challenges, we first propose a simple yet efficient multi-modal fusion mechanism Linear Fusion, that performs better than the state-of-the-art multi-modal models even with limited supervision. Second, we propose M3L: Multi-modal Teacher for Masked Modality Learning, a semi-supervised framework that not only improves the multi-modal performance but also makes the model robust to the realistic missing modality scenario using unlabeled data. We create the first benchmark for semi-supervised multi-modal semantic segmentation and also report the robustness to missing modalities. Our proposal shows an absolute improvement of up to 10% on robust mIoU above the most competitive baselines. Our code is available at https://github.com/harshm121/M3L
翻译:利用多种空间模态已被证明有助于提升语义分割性能。然而,现实中仍存在一些尚未解决的挑战:(a)提高标签效率,以及(b)增强在测试时模态缺失的真实场景下的鲁棒性。为应对这些挑战,我们首先提出一种简单高效的多模态融合机制——线性融合(Linear Fusion),该机制即使在监督信息有限的情况下,其性能也优于当前最先进的多模态模型。其次,我们提出M3L:面向掩码模态学习的多模态教师(Multi-modal Teacher for Masked Modality Learning),这是一种半监督框架,不仅提升了多模态性能,还利用无标签数据使模型对现实中的模态缺失场景具有鲁棒性。我们建立了首个半监督多模态语义分割基准,并报告了缺失模态鲁棒性。我们的方法在鲁棒性平均交并比上相比最具竞争力的基线方法实现了高达10%的绝对提升。我们的代码已开源:https://github.com/harshm121/M3L