Model merging has emerged as a powerful technique for combining specialized capabilities from multiple fine-tuned LLMs without additional training costs. However, the security implications of this widely-adopted practice remain critically underexplored. In this work, we reveal that model merging introduces a novel attack surface that can be systematically exploited to compromise safety alignment. We present TrojanMerge,, a framework that embeds latent malicious components into source models that remain individually benign but produce severely misaligned models when merged. Our key insight is formulating this attack as a constrained optimization problem: we construct perturbations that preserve source model safety through directional consistency constraints, maintain capabilities via Frobenius directional alignment constraints, yet combine during merging to form pre-computed attack vectors. Extensive experiments across 9 LLMs from 3 model families demonstrate that TrojanMerge, consistently achieves high harmful response rates in merged models while source models maintain safety scores comparable to unmodified versions. Our attack succeeds across diverse merging algorithms and remains effective under various hyperparameter configurations. These findings expose fundamental vulnerabilities in current model merging practices and highlight the urgent need for security-aware mechanisms.
翻译:模型合并作为一种强大的技术,能够在不增加额外训练成本的情况下,将多个微调大语言模型的特定能力组合起来。然而,这一广泛采用的实践在安全性方面的影响却未得到充分探索。本文揭示了模型合并引入了一种新颖的攻击面,可以被系统地利用来破坏安全对齐。我们提出了TrojanMerge框架,该框架将潜在恶意组件嵌入源模型中,这些组件单独来看是良性的,但合并后会产生严重不对齐的模型。我们的关键洞见是将该攻击形式化为一个约束优化问题:通过方向一致性约束保持源模型的安全性,通过Frobenius方向对齐约束维持模型能力,但使得这些扰动在合并时组合成预计算的攻击向量。在来自3个模型家族的9个大语言模型上的广泛实验表明,TrojanMerge在合并模型中持续实现了高有害响应率,而源模型的安全得分与未修改版本相当。我们的攻击在多种合并算法下均能成功,并在各种超参数配置下保持有效性。这些发现暴露了当前模型合并实践中的基本漏洞,并凸显了对安全感知机制的迫切需求。