Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language understanding, yet how they internally integrate visual and textual information remains poorly understood. To bridge this gap, we perform a systematic layer-wise masking analysis across multiple architectures, revealing how visual-text fusion evolves within MLLMs. The results show that fusion emerges at several specific layers rather than being uniformly distributed across the network, and certain models exhibit a late-stage "review" phenomenon where visual signals are reactivated before output generation. Besides, we further analyze layer-wise attention evolution and observe persistent high-attention noise on irrelevant regions, along with gradually increasing attention on text-aligned areas. Guided by these insights, we introduce a training-free contrastive attention framework that models the transformation between early fusion and final layers to highlight meaningful attention shifts. Extensive experiments across various MLLMs and benchmarks validate our analysis and demonstrate that the proposed approach improves multimodal reasoning performance. Code will be released.
翻译:多模态大语言模型(MLLMs)在视觉-语言理解方面取得了显著进展,然而其内部如何整合视觉与文本信息仍不甚明晰。为弥合这一认知差距,我们对多种架构进行了系统的逐层掩码分析,揭示了视觉-文本融合在MLLMs内部的演化机制。结果表明,融合现象并非均匀分布于整个网络,而是出现在若干特定层中;某些模型甚至在输出生成前会出现视觉信号被重新激活的后期"回顾"现象。此外,我们进一步分析了逐层注意力的演化过程,观察到无关区域持续存在的高注意力噪声,以及对文本对齐区域逐渐增强的关注。基于这些发现,我们提出了一种无需训练的对比注意力框架,该框架通过建模早期融合层与最终层之间的注意力转换,以突显有意义的注意力迁移。在多种MLLMs和基准测试上进行的大量实验验证了我们的分析,并证明所提方法能有效提升多模态推理性能。代码将予以公开。