Large language models fine-tuned via a two-stage pipeline (domain adaptation followed by instruction alignment) can exhibit non-trivial interference after adapter merging, including the re-emergence of explicit reasoning traces under strict decoding. We study this phenomenon in medical LLM settings using lightweight, reproducible measurements of trace leakage and instruction-following behavior. Beyond marker-based proxies, we introduce a marker-forbidden, answer-only evaluation and define a correctness-based direction that does not rely on surface markers; a rank-1 logit-space intervention along this direction modulates decision distributions and improves multiple-choice accuracy beyond random-direction controls at sufficiently large intervention strength. We further provide layer-wise geometric evidence that domain and instruction adapters induce partially misaligned update directions, and present a proof-of-concept geometry-aware merge that can reduce leakage and/or improve accuracy in a toy setting. Our results characterize boundary conditions of trace leakage and provide practical diagnostics and interventions for safer adapter merging.
翻译:通过两阶段流程(领域适应后接指令对齐)微调的大型语言模型在适配器合并后可能表现出非平凡的干扰,包括在严格解码下显式推理轨迹的重新出现。我们在医学LLM设置中研究这一现象,使用轻量级、可复现的轨迹泄漏和指令遵循行为测量方法。超越基于标记的代理指标,我们引入了一种禁止标记、仅答案的评估方式,并定义了一个不依赖表面标记的基于正确性的方向;沿此方向的秩-1对数空间干预可在足够大的干预强度下调节决策分布,并将多项选择准确率提升至超越随机方向对照组。我们进一步提供分层几何证据,表明领域适配器和指令适配器会诱导部分错位的更新方向,并提出一种概念验证的几何感知合并方法,可在玩具设置中减少泄漏和/或提高准确率。我们的结果刻画了轨迹泄漏的边界条件,并为更安全的适配器合并提供了实用诊断与干预措施。