Graph Representation Learning for Infrared and Visible Image Fusion

Infrared and visible image fusion aims to extract complementary features to synthesize a single fused image. Many methods employ convolutional neural networks (CNNs) to extract local features due to its translation invariance and locality. However, CNNs fail to consider the image's non-local self-similarity (NLss), though it can expand the receptive field by pooling operations, it still inevitably leads to information loss. In addition, the transformer structure extracts long-range dependence by considering the correlativity among all image patches, leading to information redundancy of such transformer-based methods. However, graph representation is more flexible than grid (CNN) or sequence (transformer structure) representation to address irregular objects, and graph can also construct the relationships among the spatially repeatable details or texture with far-space distance. Therefore, to address the above issues, it is significant to convert images into the graph space and thus adopt graph convolutional networks (GCNs) to extract NLss. This is because the graph can provide a fine structure to aggregate features and propagate information across the nearest vertices without introducing redundant information. Concretely, we implement a cascaded NLss extraction pattern to extract NLss of intra- and inter-modal by exploring interactions of different image pixels in intra- and inter-image positional distance. We commence by preforming GCNs on each intra-modal to aggregate features and propagate information to extract independent intra-modal NLss. Then, GCNs are performed on the concatenate intra-modal NLss features of infrared and visible images, which can explore the cross-domain NLss of inter-modal to reconstruct the fused image. Ablation studies and extensive experiments illustrates the effectiveness and superiority of the proposed method on three datasets.

翻译：红外与可见光图像融合旨在提取互补特征以合成单一融合图像。许多方法采用卷积神经网络（CNN）提取局部特征，因其具有平移不变性和局部性。然而，CNN未能考虑图像的非局部自相似性（NLss），尽管可通过池化操作扩大感受野，但仍不可避免地导致信息丢失。此外，Transformer结构通过考虑所有图像块的关联性来提取长距离依赖，导致此类方法存在信息冗余。相比之下，图表示比网格（CNN）或序列（Transformer结构）表示更灵活地处理不规则对象，且图能构建空间可重复细节或远距离纹理之间的关系。因此，为解决上述问题，将图像转换到图空间并采用图卷积网络（GCN）提取NLss具有重要意义。这是因为图能提供精细结构来聚合特征，并在最近顶点间传播信息，且不引入冗余信息。具体而言，我们通过探索图像内及图像间像素位置距离的相互作用，实现级联NLss提取模式来提取模态内和模态间的NLss。首先在每个模态内执行GCN以聚合特征并传播信息，从而提取独立的模态内NLss。随后，对红外与可见光图像的模态内NLss拼接特征执行GCN，以探索模态间的跨域NLss来重建融合图像。消融研究与大量实验证明了该方法在三个数据集上的有效性与优越性。