Multi-modal image fusion aims to integrate complementary information from multiple source images to produce high-quality fused images with enriched content. Although existing approaches based on state space model have achieved satisfied performance with high computational efficiency, they tend to either over-prioritize infrared intensity at the cost of visible details, or conversely, preserve visible structure while diminishing thermal target salience. To overcome these challenges, we propose DIFF-MF, a novel difference-driven channel-spatial state space model for multi-modal image fusion. Our approach leverages feature discrepancy maps between modalities to guide feature extraction, followed by a fusion process across both channel and spatial dimensions. In the channel dimension, a channel-exchange module enhances channel-wise interaction through cross-attention dual state space modeling, enabling adaptive feature reweighting. In the spatial dimension, a spatial-exchange module employs cross-modal state space scanning to achieve comprehensive spatial fusion. By efficiently capturing global dependencies while maintaining linear computational complexity, DIFF-MF effectively integrates complementary multi-modal features. Experimental results on the driving scenarios and low-altitude UAV datasets demonstrate that our method outperforms existing approaches in both visual quality and quantitative evaluation.
翻译:多模态图像融合旨在整合来自多个源图像的互补信息,以生成内容更丰富的高质量融合图像。尽管现有的基于状态空间模型的方法已凭借较高的计算效率取得了令人满意的性能,但它们往往要么以牺牲可见光细节为代价过度优先考虑红外强度,要么反过来,在保留可见光结构的同时削弱了热目标的显著性。为了克服这些挑战,我们提出了DIFF-MF,一种新颖的、面向多模态图像融合的差异驱动通道-空间状态空间模型。我们的方法利用模态间的特征差异图来指导特征提取,随后进行跨通道和空间维度的融合过程。在通道维度,一个通道交换模块通过交叉注意力双状态空间建模来增强通道间的交互,从而实现自适应特征重加权。在空间维度,一个空间交换模块采用跨模态状态空间扫描来实现全面的空间融合。通过高效捕获全局依赖关系,同时保持线性计算复杂度,DIFF-MF有效地整合了互补的多模态特征。在驾驶场景和低空无人机数据集上的实验结果表明,我们的方法在视觉质量和定量评估方面均优于现有方法。