As one of the tasks in Image Fusion, Infrared and Visible Image Fusion aims to integrate complementary information captured by sensors of different modalities into a single image. The Selective State Space Model (SSSM), known for its ability to capture long-range dependencies, has demonstrated its potential in the field of computer vision. However, in image fusion, current methods underestimate the potential of SSSM in capturing the global spatial information of both modalities. This limitation prevents the simultaneous consideration of the global spatial information from both modalities during interaction, leading to a lack of comprehensive perception of salient targets. Consequently, the fusion results tend to bias towards one modality instead of adaptively preserving salient targets. To address this issue, we propose the Saliency-aware Selective State Space Fusion Model (S4Fusion). In our S4Fusion, the designed Cross-Modal Spatial Awareness Module (CMSA) can simultaneously focus on global spatial information from both modalities while facilitating their interaction, thereby comprehensively capturing complementary information. Additionally, S4Fusion leverages a pre-trained network to perceive uncertainty in the fused images. By minimizing this uncertainty, S4Fusion adaptively highlights salient targets from both images. Extensive experiments demonstrate that our approach produces high-quality images and enhances performance in downstream tasks.
翻译:作为图像融合任务之一,红外与可见光图像融合旨在将不同模态传感器捕获的互补信息整合至单幅图像中。选择性状态空间模型(SSSM)以其捕获长程依赖的能力著称,已在计算机视觉领域展现出潜力。然而,在图像融合任务中,现有方法低估了SSSM在捕获双模态全局空间信息方面的潜力。这一局限导致在交互过程中无法同时考量来自双模态的全局空间信息,从而缺乏对显著目标的全面感知。因此,融合结果往往偏向某一模态,而非自适应地保留显著目标。为解决此问题,我们提出显著性感知选择性状态空间融合模型(S4Fusion)。在我们的S4Fusion中,所设计的跨模态空间感知模块(CMSA)能够同时聚焦于双模态的全局空间信息并促进其交互,从而全面捕获互补信息。此外,S4Fusion利用预训练网络感知融合图像中的不确定性,通过最小化该不确定性,自适应地增强双图像中的显著目标。大量实验表明,我们的方法能够生成高质量图像,并提升下游任务性能。