Large language models can exhibit surprising adapter interference when combining domain adaptation and instruction alignment in safety-critical settings. We study a two-stage LoRA pipeline for medical LLMs, where domain-oriented pre-training (PT) and supervised fine-tuning (SFT) are trained separately and later merged through weighted adapter merging. We observe that introducing PT signal can systematically alter model behavior and produce reasoning-style outputs, even when evaluation templates explicitly attempt to suppress such behavior. This interference leads to a divergence between surface metrics and reasoning or alignment behavior: BLEU/ROUGE scores drop significantly, while multiple-choice accuracy improves. We further show that small pipeline mistakes can easily misattribute SFT-only behavior to merged models, and provide a lightweight merge-verification routine to ensure correctness and reproducibility. Our findings highlight an interaction between knowledge injection and instruction alignment in adapter-based fine-tuning, with important implications for safety-critical model deployment.
翻译:在安全关键场景中结合领域适应与指令对齐时,大语言模型可能表现出令人惊讶的适配器干扰现象。本研究针对医学大语言模型的两阶段LoRA流程展开分析,其中领域导向预训练(PT)与监督微调(SFT)分别训练,随后通过加权适配器合并进行融合。我们发现,即使评估模板明确尝试抑制此类行为,引入PT信号仍会系统性改变模型行为并产生推理式输出。这种干扰导致表层指标与推理或对齐行为之间出现背离:BLEU/ROUGE分数显著下降,而多项选择题准确率却有所提升。我们进一步证明,流程中的细微错误容易将仅含SFT的行为错误归因于合并模型,并提出了轻量级的合并验证流程以确保正确性与可复现性。本研究揭示了基于适配器的微调中知识注入与指令对齐之间的相互作用,对安全关键模型的部署具有重要启示。