Infrared and visible image fusion aims to integrate complementary modalities, while existing Euclidean methods impose rigid distance metrics that distort multi-modal interactions and parent-to-child semantic hierarchies. To overcome these limitations, we introduce a text-driven fusion framework empowered by hyperbolic manifold learning. During training, BLIP-extracted text prompts serve as topological anchors within the hyperbolic space, guiding vision-attribute alignment through hyperbolic embeddings that naturally accommodate varying semantic granularities. By exploiting the exponential volume growth dictated by the Poincaré ball's negative curvature, this approach seamlessly embeds hierarchical trees to encode coarse-to-fine semantics without metric saturation, while the vast peripheral space prevents texture distortion during cross-modal fusion. At inference, the fusion process autonomously adapts to input content using the learned text-attribute priors, completely eliminating the need for textual input. Experimental results show our method outperforms state-of-the-art approaches on benchmark datasets, with code available at https://github.com/Shaoyun2023/TEDFusion.
翻译:红外与可见光图像融合旨在整合互补模态信息,而现有欧几里得方法采用刚性距离度量,扭曲了多模态交互和父子语义层级。为克服这些局限性,我们提出了一种基于双曲流形学习的文本驱动融合框架。训练阶段,BLIP提取的文本提示作为双曲空间的拓扑锚点,通过能自然适应不同语义粒度的双曲嵌入引导视觉-语义对齐。利用庞加莱球负曲率驱动的指数级体积增长,该方法在无度量饱和条件下,将层级化树结构嵌入以编码从粗到细的语义信息,同时其广袤边界空间避免了跨模态融合中的纹理失真。推理阶段,融合过程基于学得的文本-语义先验自主适应输入内容,完全无需文本输入。实验结果表明,本方法在基准数据集上优于现有最先进方法,代码开源于https://github.com/Shaoyun2023/TEDFusion。