Advanced image fusion methods are devoted to generating the fusion results by aggregating the complementary information conveyed by the source images. However, the difference in the source-specific manifestation of the imaged scene content makes it difficult to design a robust and controllable fusion process. We argue that this issue can be alleviated with the help of higher-level semantics, conveyed by the text modality, which should enable us to generate fused images for different purposes, such as visualisation and downstream tasks, in a controllable way. This is achieved by exploiting a vision-and-language model to build a coarse-to-fine association mechanism between the text and image signals. With the guidance of the association maps, an affine fusion unit is embedded in the transformer network to fuse the text and vision modalities at the feature level. As another ingredient of this work, we propose the use of textual attention to adapt image quality assessment to the fusion task. To facilitate the implementation of the proposed text-guided fusion paradigm, and its adoption by the wider research community, we release a text-annotated image fusion dataset IVT. Extensive experiments demonstrate that our approach (TextFusion) consistently outperforms traditional appearance-based fusion methods. Our code and dataset will be publicly available at https://github.com/AWCXV/TextFusion.
翻译:先进的图像融合方法致力于通过聚合源图像传递的互补信息来生成融合结果。然而,由于成像场景内容在特定源中的表现存在差异,设计鲁棒且可控的融合过程变得困难。我们提出,这一问题可通过利用文本模态传递的更高层次语义加以缓解,从而能够以可控方式生成面向不同目的(如可视化与下游任务)的融合图像。具体实现上,我们借助视觉-语言模型构建文本与图像信号之间的由粗到细关联机制,并在关联图引导下,将仿射融合单元嵌入Transformer网络以实现特征层面的文本与视觉模态融合。作为本工作的另一组成部分,我们提出利用文本注意力机制将图像质量评估适配至融合任务。为促进所提出的文本引导融合范式的实施及其在更广泛研究社区中的采用,我们发布了文本标注的图像融合数据集IVT。大量实验表明,我们的方法(TextFusion)在性能上持续优于传统基于外观的融合方法。我们的代码和数据集将在https://github.com/AWCXV/TextFusion公开提供。