Harmful fine-tuning attacks pose a major threat to the security of large language models (LLMs), allowing adversaries to compromise safety guardrails with minimal harmful data. While existing defenses attempt to reinforce LLM alignment, they fail to address models' inherent "trainability" on harmful data, leaving them vulnerable to stronger attacks with increased learning rates or larger harmful datasets. To overcome this critical limitation, we introduce SEAM, a novel alignment-enhancing defense that transforms LLMs into self-destructive models with intrinsic resilience to misalignment attempts. Specifically, these models retain their capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data. The protection is achieved through a novel loss function that couples the optimization trajectories of benign and harmful data, enhanced with adversarial gradient ascent to amplify the self-destructive effect. To enable practical training, we develop an efficient Hessian-free gradient estimate with theoretical error bounds. Extensive evaluation across LLMs and datasets demonstrates that SEAM creates a no-win situation for adversaries: the self-destructive models achieve state-of-the-art robustness against low-intensity attacks and undergo catastrophic performance collapse under high-intensity attacks, rendering them effectively unusable. The code is available: https://github.com/ZJUWYH/seam. (Warning: this paper contains potentially harmful content generated by LLMs.)
翻译:有害微调攻击对大型语言模型(LLMs)的安全性构成重大威胁,攻击者仅需少量有害数据即可破坏其安全防护机制。现有防御方法试图强化LLM的对齐性,但未能解决模型对有害数据固有的“可训练性”问题,导致在面对更高学习率或更大规模有害数据集的增强攻击时依然脆弱。为克服这一关键局限,我们提出SEAM——一种创新的对齐增强防御方法,可将LLMs转化为具有内在抗错配能力的自毁模型。具体而言,这些模型在保持正常任务能力的同时,在对有害数据进行微调时会表现出显著的性能退化。该保护机制通过新型损失函数实现,该函数耦合了良性数据与有害数据的优化轨迹,并结合对抗性梯度上升技术以增强自毁效应。为实现高效训练,我们开发了具有理论误差界的高效无Hessian梯度估计方法。跨LLM和数据集的大规模评估表明,SEAM为攻击者创造了无利可图的局面:自毁模型在低强度攻击下达到最先进的鲁棒性,而在高强度攻击下会发生灾难性性能崩溃,使其实际失效。代码已开源:https://github.com/ZJUWYH/seam。(警告:本文包含LLM生成的可能有害内容。)