Multi-modal semantic segmentation significantly enhances AI agents' perception and scene understanding, especially under adverse conditions like low-light or overexposed environments. Leveraging additional modalities (X-modality) like thermal and depth alongside traditional RGB provides complementary information, enabling more robust and reliable segmentation. In this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation, utilizing the Selective Structured State Space Model, Mamba. Unlike conventional methods that rely on CNNs, with their limited local receptive fields, or Vision Transformers (ViTs), which offer global receptive fields at the cost of quadratic complexity, our model achieves global receptive fields coverage with linear complexity. By employing a Siamese encoder and innovating a Mamba fusion mechanism, we effectively select essential information from different modalities. A decoder is then developed to enhance the channel-wise modeling ability of the model. Our method, Sigma, is rigorously evaluated on both RGB-Thermal and RGB-Depth segmentation tasks, demonstrating its superiority and marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks. Code is available at https://github.com/zifuwan/Sigma.
翻译:多模态语义分割显著增强了AI代理的感知与场景理解能力,尤其在弱光或过曝等恶劣环境中。通过利用热红外、深度等额外模态(X模态)与传统的RGB模态互补,可提供更鲁棒且可靠的分割结果。本文提出Sigma——一种基于选择性结构化状态空间模型Mamba的多模态语义分割孪生网络。与依赖局部感受野受限的CNN或牺牲平方复杂度换取全局感受野的Vision Transformer(ViT)不同,本模型以线性复杂度实现全局感受野覆盖。通过采用孪生编码器并创新设计Mamba融合机制,我们有效筛选不同模态的关键信息;随后构建解码器以增强模型的通道级建模能力。Sigma方法在RGB-热红外与RGB-深度分割任务上经过严格评估,展现出优越性能,标志着状态空间模型(SSMs)在多模态感知任务中的首次成功应用。代码开源于https://github.com/zifuwan/Sigma。