Existing text-driven infrared and visible image fusion approaches often rely on textual information at the sentence level, which can lead to semantic noise from redundant text and fail to fully exploit the deeper semantic value of textual information. To address these issues, we propose a novel fusion approach named Entity-Guided Multi-Task learning for infrared and visible image fusion (EGMT). Our approach includes three key innovative components: (i) A principled method is proposed to extract entity-level textual information from image captions generated by large vision-language models, eliminating semantic noise from raw text while preserving critical semantic information; (ii) A parallel multi-task learning architecture is constructed, which integrates image fusion with a multi-label classification task. By using entities as pseudo-labels, the multi-label classification task provides semantic supervision, enabling the model to achieve a deeper understanding of image content and significantly improving the quality and semantic density of the fused image; (iii) An entity-guided cross-modal interactive module is also developed to facilitate the fine-grained interaction between visual and entity-level textual features, which enhances feature representation by capturing cross-modal dependencies at both inter-visual and visual-entity levels. To promote the wide application of the entity-guided image fusion framework, we release the entity-annotated version of four public datasets (i.e., TNO, RoadScene, M3FD, and MSRS). Extensive experiments demonstrate that EGMT achieves superior performance in preserving salient targets, texture details, and semantic consistency, compared to the state-of-the-art methods. The code and dataset will be publicly available at https://github.com/wyshao-01/EGMT.
翻译:现有的文本驱动红外与可见光图像融合方法通常依赖句子级别的文本信息,这可能导致冗余文本带来的语义噪声,且未能充分利用文本信息更深层的语义价值。为解决这些问题,我们提出了一种名为实体引导多任务红外与可见光图像融合学习(EGMT)的新型融合方法。该方法包含三个关键创新组件:(i)提出一种从大型视觉语言模型生成的图像描述中提取实体级文本信息的原理性方法,在消除原始文本语义噪声的同时保留关键语义信息;(ii)构建并行多任务学习架构,将图像融合与多标签分类任务相结合。通过将实体作为伪标签,多标签分类任务提供语义监督,使模型能更深入理解图像内容,显著提升融合图像的质量与语义密度;(iii)开发实体引导的跨模态交互模块,促进视觉特征与实体级文本特征间的细粒度交互,通过捕获视觉间与视觉-实体层面的跨模态依赖关系增强特征表示。为促进实体引导图像融合框架的广泛应用,我们发布了四个公开数据集(即TNO、RoadScene、M3FD与MSRS)的实体标注版本。大量实验表明,相较于现有先进方法,EGMT在保留显著目标、纹理细节和语义一致性方面均取得更优性能。代码与数据集将在https://github.com/wyshao-01/EGMT 公开。