Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions. Our repository is on https://github.com/wzj1718/PlaM.
翻译:多模态大语言模型(MLLMs)依赖于其基础语言模型所继承的强大语言推理能力。然而,多模态指令微调却会削弱这种文本推理能力,从而损害多模态性能。为解决这一问题,我们提出了一种无需训练的框架来缓解这种性能退化。通过逐层视觉令牌掩码分析,我们揭示了多模态大语言模型中普遍存在的三阶段模式:早期模态分离、中期模态对齐和后期模态退化。通过分析MLLMs在不同阶段的行为,我们提出了一种高原引导的模型融合方法,该方法有选择地将基础语言模型的参数注入到MLLMs中。基于五个MLLMs在九个基准测试上的实验结果验证了我们方法的有效性。基于注意力的分析进一步表明,融合操作使注意力模式从分散、零散的状态转变为聚焦于任务相关视觉区域的精确定位。我们的代码仓库位于 https://github.com/wzj1718/PlaM。