实体引导的多任务红外与可见光图像融合学习 (Entity-Guided Multi-Task Learning for Infrared and Visible Image Fusion)

Existing text-driven infrared and visible image fusion approaches often rely on textual information at the sentence level, which can lead to semantic noise from redundant text and fail to fully exploit the deeper semantic value of textual information. To address these issues, we propose a novel fusion approach named Entity-Guided Multi-Task learning for infrared and visible image fusion (EGMT). Our approach includes three key innovative components: (i) A principled method is proposed to extract entity-level textual information from image captions generated by large vision-language models, eliminating semantic noise from raw text while preserving critical semantic information; (ii) A parallel multi-task learning architecture is constructed, which integrates image fusion with a multi-label classification task. By using entities as pseudo-labels, the multi-label classification task provides semantic supervision, enabling the model to achieve a deeper understanding of image content and significantly improving the quality and semantic density of the fused image; (iii) An entity-guided cross-modal interactive module is also developed to facilitate the fine-grained interaction between visual and entity-level textual features, which enhances feature representation by capturing cross-modal dependencies at both inter-visual and visual-entity levels. To promote the wide application of the entity-guided image fusion framework, we release the entity-annotated version of four public datasets (i.e., TNO, RoadScene, M3FD, and MSRS). Extensive experiments demonstrate that EGMT achieves superior performance in preserving salient targets, texture details, and semantic consistency, compared to the state-of-the-art methods. The code and dataset will be publicly available at https://github.com/wyshao-01/EGMT.

翻译：现有的文本驱动红外与可见光图像融合方法通常依赖句子级别的文本信息，这可能导致冗余文本带来的语义噪声，且未能充分利用文本信息更深层的语义价值。为解决这些问题，我们提出了一种名为实体引导多任务红外与可见光图像融合学习（EGMT）的新型融合方法。该方法包含三个关键创新组件：（i）提出一种从大型视觉语言模型生成的图像描述中提取实体级文本信息的原理性方法，在消除原始文本语义噪声的同时保留关键语义信息；（ii）构建并行多任务学习架构，将图像融合与多标签分类任务相结合。通过将实体作为伪标签，多标签分类任务提供语义监督，使模型能更深入理解图像内容，显著提升融合图像的质量与语义密度；（iii）开发实体引导的跨模态交互模块，促进视觉特征与实体级文本特征间的细粒度交互，通过捕获视觉间与视觉-实体层面的跨模态依赖关系增强特征表示。为促进实体引导图像融合框架的广泛应用，我们发布了四个公开数据集（即TNO、RoadScene、M3FD与MSRS）的实体标注版本。大量实验表明，相较于现有先进方法，EGMT在保留显著目标、纹理细节和语义一致性方面均取得更优性能。代码与数据集将在https://github.com/wyshao-01/EGMT 公开。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

面向具身智能的多传感器融合感知综述：背景、方法、挑战与前景

专知会员服务

18+阅读 · 2025年6月29日

【博士论文】学习视觉-语言表示以实现多模态理解

专知会员服务

28+阅读 · 2025年2月8日

《利用真实和合成红外海上图像进行自动目标识别的深度学习》英国国防学院

专知会员服务

44+阅读 · 2023年6月25日

《多模态传感器融合与深度学习》美海军研究实验室19页报告

专知会员服务

116+阅读 · 2023年4月1日