PlaM：基于训练自由的高原引导模型融合方法以提升多模态大语言模型的视觉定位能力 (PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs)

Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions. Our repository is on https://github.com/wzj1718/PlaM.

翻译：多模态大语言模型（MLLMs）依赖于其基础语言模型所继承的强大语言推理能力。然而，多模态指令微调却会削弱这种文本推理能力，从而损害多模态性能。为解决这一问题，我们提出了一种无需训练的框架来缓解这种性能退化。通过逐层视觉令牌掩码分析，我们揭示了多模态大语言模型中普遍存在的三阶段模式：早期模态分离、中期模态对齐和后期模态退化。通过分析MLLMs在不同阶段的行为，我们提出了一种高原引导的模型融合方法，该方法有选择地将基础语言模型的参数注入到MLLMs中。基于五个MLLMs在九个基准测试上的实验结果验证了我们方法的有效性。基于注意力的分析进一步表明，融合操作使注意力模式从分散、零散的状态转变为聚焦于任务相关视觉区域的精确定位。我们的代码仓库位于 https://github.com/wzj1718/PlaM。