Leveraging information across diverse modalities is known to enhance performance on multimodal segmentation tasks. However, effectively fusing information from different modalities remains challenging due to the unique characteristics of each modality. In this paper, we propose a novel fusion strategy that can effectively fuse information from different combinations of four different modalities: RGB, Angle of Linear Polarization (AoLP), Degree of Linear Polarization (DoLP) and Near-Infrared (NIR). We also propose a new model named Multi-Modal Segmentation Transformer (MMSFormer) that incorporates the proposed fusion strategy to perform multimodal material segmentation. MMSFormer achieves 52.05% mIoU outperforming the current state-of-the-art on Multimodal Material Segmentation (MCubeS) dataset. For instance, our method provides significant improvement in detecting gravel (+10.4%) and human (+9.1%) classes. Ablation studies show that different modules in the fusion block are crucial for overall model performance. Furthermore, our ablation studies also highlight the capacity of different input modalities to improve performance in the identification of different types of materials. The code and pretrained models will be made available at https://github.com/csiplab/MMSFormer.
翻译:跨不同模态的信息利用已知能提升多模态分割任务的性能。然而,由于各模态的独特特性,有效融合来自不同模态的信息仍具挑战性。本文提出一种新颖的融合策略,可有效融合四种不同模态(RGB、线偏振角(AoLP)、线偏振度(DoLP)和近红外(NIR))的任意组合信息。我们还提出一种名为多模态分割Transformer(MMSFormer)的新模型,该模型融合了所提出的融合策略以执行多模态材料分割。MMSFormer在MCubeS(多模态材料分割)数据集上达到52.05%的mIoU,超越了当前最先进的性能。例如,我们的方法在检测砾石(+10.4%)和人类(+9.1%)类别时取得了显著提升。消融实验表明,融合块中的不同模块对模型整体性能至关重要。此外,我们的消融实验还凸显了不同输入模态在提升识别不同材料类型性能方面的能力。代码与预训练模型将在https://github.com/csiplab/MMSFormer 开源。