Despite the general capabilities of Large Language Models (LLMs) like GPT-4 and Llama-2, these models still request fine-tuning or adaptation with customized data when it comes to meeting the specific business demands and intricacies of tailored use cases. However, this process inevitably introduces new safety threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack), where incorporating just a few harmful examples into the fine-tuning dataset can significantly compromise the model safety. Though potential defenses have been proposed by incorporating safety examples into the fine-tuning dataset to reduce the safety issues, such approaches require incorporating a substantial amount of safety examples, making it inefficient. To effectively defend against the FJAttack with limited safety examples, we propose a Backdoor Enhanced Safety Alignment method inspired by an analogy with the concept of backdoor attacks. In particular, we construct prefixed safety examples by integrating a secret prompt, acting as a "backdoor trigger", that is prefixed to safety examples. Our comprehensive experiments demonstrate that through the Backdoor Enhanced Safety Alignment with adding as few as 11 prefixed safety examples, the maliciously fine-tuned LLMs will achieve similar safety performance as the original aligned models. Furthermore, we also explore the effectiveness of our method in a more practical setting where the fine-tuning data consists of both FJAttack examples and the fine-tuning task data. Our method shows great efficacy in defending against FJAttack without harming the performance of fine-tuning tasks.
翻译:尽管GPT-4和Llama-2等大型语言模型(LLMs)具备通用能力,但在满足特定业务需求和定制化场景的复杂性时,这些模型仍需通过微调或适配定制数据来优化。然而,这一过程不可避免地引入了新的安全威胁,尤其是基于微调的越狱攻击(FJAttack),在微调数据集中仅加入少量有害样本即可显著削弱模型的安全性。尽管已有研究提出通过在微调数据集中加入安全样本来缓解安全问题,但这些方法需要大量安全样本,导致效率低下。为在有限安全样本下有效防御FJAttack,我们受后门攻击概念启发,提出了一种后门增强的安全对齐方法。具体而言,我们构建了前缀安全样本,通过集成一个秘密提示作为“后门触发器”,并将其添加到安全样本之前。全面实验表明,通过后门增强安全对齐,仅需添加11个前缀安全样本即可使恶意微调的LLMs达到与原始对齐模型相当的安全性能。此外,我们还探索了该方法在更实际场景中的有效性,即微调数据同时包含FJAttack样本和微调任务数据。我们的方法在不损害微调任务性能的情况下,展现出防御FJAttack的卓越效果。