Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. The high dimensionality of potential triggers in the token space and the diverse range of malicious behaviors make this a critical challenge. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations. Experiments show BEEAR reduces the success rate of RLHF time backdoor attacks from >95% to <1% and from 47% to 0% for instruction-tuning time backdoors targeting malicious code generation, without compromising model utility. Requiring only defender-defined safe and unwanted behaviors, BEEAR represents a step towards practical defenses against safety backdoors in LLMs, providing a foundation for further advancements in AI safety and security.
翻译:大型语言模型(LLM)中的安全后门攻击能够在正常交互中规避检测,同时隐秘地触发不安全行为。令牌空间中潜在触发器的高维性以及恶意行为的多样性使得这一问题成为关键挑战。我们提出BEEAR,一种缓解方法,其核心洞见在于后门触发器会在模型的嵌入空间中引发相对均匀的漂移。我们的双层优化方法识别出能引发不良行为的通用嵌入扰动,并调整模型参数以强化针对这些扰动的安全行为。实验表明,BEEAR将RLHF阶段后门攻击的成功率从>95%降至<1%,并将针对恶意代码生成的指令调优阶段后门攻击的成功率从47%降至0%,且不损害模型效用。BEEAR仅需防御者定义的安全行为与不良行为,代表了向实用化防御LLM安全后门迈出的一步,为人工智能安全与安保的进一步发展奠定了基础。