Multi-modality image fusion is a technique used to combine information from different sensors or modalities, allowing the fused image to retain complementary features from each modality, such as functional highlights and texture details. However, effectively training such fusion models is difficult due to the lack of ground truth fusion data. To address this issue, we propose the Equivariant Multi-Modality imAge fusion (EMMA) paradigm for end-to-end self-supervised learning. Our approach is based on the prior knowledge that natural images are equivariant to specific transformations. Thus, we introduce a novel training framework that includes a fusion module and a learnable pseudo-sensing module, which allow the network training to follow the principles of physical sensing and imaging process, and meanwhile satisfy the equivariant prior for natural images. Our extensive experiments demonstrate that our method produces high-quality fusion results for both infrared-visible and medical images, while facilitating downstream multi-modal segmentation and detection tasks. The code will be released.
翻译:多模态图像融合是一种结合不同传感器或模态信息的技术,使融合图像能够保留各模态的互补特征,例如功能高光区域和纹理细节。然而,由于缺乏真实融合数据,有效训练此类融合模型具有挑战性。为解决这一问题,我们提出了等变多模态图像融合(EMMA)范式,用于端到端自监督学习。我们的方法基于自然图像对特定变换具有等变性的先验知识。为此,我们引入了一种新型训练框架,包含融合模块和可学习的伪传感模块,使网络训练遵循物理传感与成像过程原理,同时满足自然图像的等变先验。大量实验表明,我们的方法能够为红外-可见光图像和医学图像生成高质量融合结果,并促进下游多模态分割与检测任务。代码将公开发布。