FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba

Multi-modal image fusion aims to combine information from different modes to create a single image with comprehensive information and detailed textures. However, fusion models based on convolutional neural networks encounter limitations in capturing global image features due to their focus on local convolution operations. Transformer-based models, while excelling in global feature modeling, confront computational challenges stemming from their quadratic complexity. Recently, the Selective Structured State Space Model has exhibited significant potential for long-range dependency modeling with linear complexity, offering a promising avenue to address the aforementioned dilemma. In this paper, we propose FusionMamba, a novel dynamic feature enhancement method for multimodal image fusion with Mamba. Specifically, we devise an improved efficient Mamba model for image fusion, integrating efficient visual state space model with dynamic convolution and channel attention. This refined model not only upholds the performance of Mamba and global modeling capability but also diminishes channel redundancy while enhancing local enhancement capability. Additionally, we devise a dynamic feature fusion module (DFFM) comprising two dynamic feature enhancement modules (DFEM) and a cross modality fusion mamba module (CMFM). The former serves for dynamic texture enhancement and dynamic difference perception, whereas the latter enhances correlation features between modes and suppresses redundant intermodal information. FusionMamba has yielded state-of-the-art (SOTA) performance across various multimodal medical image fusion tasks (CT-MRI, PET-MRI, SPECT-MRI), infrared and visible image fusion task (IR-VIS) and multimodal biomedical image fusion dataset (GFP-PC), which is proved that our model has generalization ability. The code for FusionMamba is available at https://github.com/millieXie/FusionMamba.

翻译：多模态图像融合旨在整合不同模态的信息，生成一张兼具全面信息与精细纹理的单一图像。然而，基于卷积神经网络的融合模型因侧重于局部卷积操作，在捕捉全局图像特征方面存在局限。基于Transformer的模型虽擅长全局特征建模，却因二次复杂度而面临计算挑战。近期，选择性结构化状态空间模型在具备线性复杂度的同时展现出长程依赖建模的巨大潜力，为解决上述困境提供了有前景的途径。本文提出FusionMamba——一种基于Mamba的新型动态特征增强多模态图像融合方法。具体而言，我们针对图像融合任务设计了一种改进的高效Mamba模型，将高效视觉状态空间模型与动态卷积及通道注意力机制相结合。该改进模型不仅保持了Mamba的性能与全局建模能力，还通过增强局部增强能力的同时降低了通道冗余。此外，我们设计了一个动态特征融合模块（DFFM），该模块包含两个动态特征增强模块（DFEM）和一个跨模态融合Mamba模块（CMFM）。前者用于动态纹理增强与动态差异感知，后者则增强模态间的相关特征并抑制冗余的跨模态信息。FusionMamba在多种多模态医学图像融合任务（CT-MRI、PET-MRI、SPECT-MRI）、红外与可见光图像融合任务（IR-VIS）以及多模态生物医学图像融合数据集（GFP-PC）上均取得了最先进（SOTA）性能，证明了模型的泛化能力。FusionMamba的代码已开源在https://github.com/millieXie/FusionMamba。