Multimodal image fusion aims to integrate complementary information from different modalities into a fused image that preserves rich local details while maintaining globally consistent appearance. Existing approaches build shared representations on 2D feature grids, which excel at modeling local structures but offer limited leverage over image-level global appearance factors. To balance these objectives, we introduce a compact 1D token interface based on a frozen pretrained image tokenizer for modeling non-local appearance/base factors. Rather than using the tokenizer as a reconstruction backbone, our design uses the 1D token space as a global carrier while retaining the 2D spatial pathway for local structure restoration. Specifically, we introduce Selective Token Editing (STE), which sparsely updates/replaces a small set of critical tokens, providing a lightweight mechanism to steer global appearance coherence while keeping the fusion backbone unchanged and avoiding extra losses. Experiments on four commonly used benchmarks show that our method achieves the best overall performance, with consistent, multi-metric improvements in both global coherence and local fidelity. Project page: https://zju-xyc.github.io/1D-Fusion-Project-Page/
翻译:多模态图像融合旨在将不同模态的互补信息整合到一幅融合图像中,该图像既能保留丰富的局部细节,又能保持全局外观的一致性。现有方法在二维特征网格上构建共享表征,这类方法擅长建模局部结构,但在控制图像级全局外观因素方面能力有限。为平衡上述目标,我们引入了一种基于冻结预训练图像分词器的紧凑一维标记接口,用于建模非局部外观/基础因素。我们的设计并非将分词器用作重建主干,而是将一维标记空间作为全局载体,同时保留二维空间路径用于局部结构重建。具体而言,我们提出选择性标记编辑(STE),该方法稀疏地更新/替换少量关键标记,提供一种轻量级机制来引导全局外观一致性,同时保持融合主干不变并避免额外损失。在四个常用基准数据集上的实验表明,我们的方法在全局一致性和局部保真度方面均取得了最佳整体性能,且实现了持续的多指标提升。项目页面:https://zju-xyc.github.io/1D-Fusion-Project-Page/