Infrared-visible image fusion (IVIF) has attracted much attention owing to the highly-complementary properties of the two image modalities. Due to the lack of ground-truth fused images, the fusion output of current deep-learning based methods heavily depends on the loss functions defined mathematically. As it is hard to well mathematically define the fused image without ground truth, the performance of existing fusion methods is limited. In this paper, we first propose to use natural language to express the objective of IVIF, which can avoid the explicit mathematical modeling of fusion output in current losses, and make full use of the advantage of language expression to improve the fusion performance. For this purpose, we present a comprehensive language-expressed fusion objective, and encode relevant texts into the multi-modal embedding space using CLIP. A language-driven fusion model is then constructed in the embedding space, by establishing the relationship among the embedded vectors to represent the fusion objective and input image modalities. Finally, a language-driven loss is derived to make the actual IVIF aligned with the embedded language-driven fusion model via supervised training. Experiments show that our method can obtain much better fusion results than existing techniques.
翻译:红外-可见光图像融合(IVIF)因两种图像模态的高度互补特性而备受关注。由于缺乏真实的融合图像,当前基于深度学习的融合输出严重依赖数学定义的损失函数。由于无法在没有真实图像的情况下对融合图像进行精确数学建模,现有融合方法的性能受到限制。本文首次提出利用自然语言表达IVIF目标,可避免现有损失函数对融合输出的显式数学建模,并充分利用语言表达的优势提升融合性能。为此,我们提出了一个全面的语言表达的融合目标,并利用CLIP将相关文本编码到多模态嵌入空间中。随后通过建立嵌入向量间的关系来表达融合目标与输入图像模态,在嵌入空间中构建语言驱动的融合模型。最终推导出语言驱动损失,通过监督训练使实际IVIF与嵌入的语言驱动融合模型对齐。实验表明,本方法可获得显著优于现有技术的融合结果。