Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

Despite the general capabilities of Large Language Models (LLM), these models still request fine-tuning or adaptation with customized data when meeting specific business demands. However, this process inevitably introduces new threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack) under the setting of Language-Model-as-a-Service (LMaaS), where the model's safety has been significantly compromised by fine-tuning users' uploaded examples contain just a few harmful examples. Though potential defenses have been proposed that the service providers can integrate safety examples into the fine-tuning dataset to reduce safety issues, such approaches require incorporating a substantial amount of data, making it inefficient. To effectively defend against the FJAttack with limited safety examples under LMaaS, we propose the Backdoor Enhanced Safety Alignment method inspired by an analogy with the concept of backdoor attacks. In particular, service providers will construct prefixed safety examples with a secret prompt, acting as a "backdoor trigger". By integrating prefixed safety examples into the fine-tuning dataset, the subsequent fine-tuning process effectively acts as the "backdoor attack", establishing a strong correlation between the secret prompt and safety generations. Consequently, safe responses are ensured once service providers prepend this secret prompt ahead of any user input during inference. Our comprehensive experiments demonstrate that through the Backdoor Enhanced Safety Alignment with adding as few as 11 prefixed safety examples, the maliciously fine-tuned LLMs will achieve similar safety performance as the original aligned models without harming the benign performance. Furthermore, we also present the effectiveness of our method in a more practical setting where the fine-tuning data consists of both FJAttack examples and the fine-tuning task data.

翻译：尽管大型语言模型具备通用能力，但在满足特定业务需求时，这些模型仍需通过定制数据进行微调或适配。然而，这一过程不可避免地引入了新的威胁，尤其是在语言模型即服务场景下，基于微调的越狱攻击通过用户上传的少量有害示例即可严重破坏模型安全性。虽然已有防御方案提出服务提供商可将安全示例整合至微调数据集以降低安全风险，但此类方法需要纳入大量数据，效率低下。为在LMaaS场景下利用有限安全示例有效防御FJAttack，我们受后门攻击概念启发，提出后门增强安全对齐方法。具体而言，服务提供商将构建带有秘密提示符的前缀式安全示例作为“后门触发器”。通过将前缀式安全示例整合至微调数据集，后续微调过程实质上构成“后门攻击”，在秘密提示符与安全生成内容间建立强关联。因此，在推理阶段服务提供商只需在任何用户输入前添加该秘密提示符，即可确保生成安全响应。综合实验表明，通过后门增强安全对齐方法仅需添加11个前缀式安全示例，恶意微调后的LLM即可在保持良性性能的同时，达到与原始对齐模型相当的安全性能。此外，我们还验证了该方法在更实际场景下的有效性，即当微调数据同时包含FJAttack示例与微调任务数据时。