This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multi-modal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level. Specifically, we develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image. MagicFuse first introduces an intra-spectral knowledge reinforcement branch and a cross-spectral knowledge generation branch based on the diffusion models. They mine scene information obscured in the visible spectrum and learn thermal radiation distribution patterns transferred to the infrared spectrum, respectively. Building on them, we design a multi-domain knowledge fusion branch that integrates the probabilistic noise from the diffusion streams of these two branches, from which a cross-spectral scene representation can be obtained through successive sampling. Then, we impose both visual and semantic constraints to ensure that this scene representation can satisfy human observation while supporting downstream semantic decision-making. Extensive experiments show that our MagicFuse achieves visual and semantic representation performance comparable to or even better than state-of-the-art fusion methods with multi-modal inputs, despite relying solely on a single degraded visible image.
翻译:本文聚焦于一个极具实用价值的场景:在仅有可见光成像传感器的严苛条件下,如何继续受益于多模态图像融合的优势。为实现这一目标,我们提出了一种新颖的单图像融合概念,将传统的数据级融合拓展至知识层面。具体而言,我们开发了MagicFuse——一个能够从单张低质量可见光图像中推导出全面跨光谱场景表征的新型单图像融合框架。MagicFuse首先基于扩散模型引入了光谱内知识强化分支与跨光谱知识生成分支,分别挖掘可见光谱中被遮蔽的场景信息,以及学习迁移至红外光谱的热辐射分布规律。在此基础上,我们设计了一个多域知识融合分支,通过整合上述两个分支扩散流中的概率噪声,可经由逐次采样获得跨光谱场景表征。随后,我们施加视觉与语义双重约束,以确保该场景表征既能满足人类观测需求,又能支持下游语义决策。大量实验表明,尽管仅依赖单张退化可见光图像,我们的MagicFuse在视觉与语义表征性能上达到了与多模态输入的前沿融合方法相当甚至更优的水平。