Large language models (LLMs) are widely adapted for downstream applications through fine-tuning, a process named customization. However, recent studies have identified a vulnerability during this process, where malicious samples can compromise the robustness of LLMs and amplify harmful behaviors-an attack commonly referred to as jailbreaking. To address this challenge, we propose an adaptive data curation approach allowing any text to be curated to enhance its effectiveness in counteracting harmful samples during customization. To avoid the need for additional defensive modules, we further introduce a comprehensive mitigation framework spanning the lifecycle of the customization process: before customization to immunize LLMs against future jailbreak attempts, during customization to neutralize risks, and after customization to restore compromised models. Experimental results demonstrate a significant reduction in jailbreaking effects, achieving up to a 100% success rate in generating safe responses. By combining adaptive data curation with lifecycle-based mitigation strategies, this work represents a solid step forward in mitigating jailbreaking risks and ensuring the secure adaptation of LLMs.
翻译:大型语言模型(LLMs)通常通过微调(这一过程称为定制化)广泛应用于下游任务。然而,近期研究发现该过程中存在一个脆弱环节:恶意样本可能破坏LLMs的鲁棒性并放大有害行为——这种攻击通常被称为越狱。为应对这一挑战,我们提出一种自适应数据策展方法,该方法允许对任意文本进行策展,以增强其在定制化过程中抵御有害样本的有效性。为避免引入额外的防御模块,我们进一步提出一个覆盖定制化全生命周期的综合缓解框架:在定制化前为LLMs建立对未来越狱尝试的免疫能力,在定制化中消除风险,并在定制化后修复受损模型。实验结果表明,该方法能显著降低越狱效应,在生成安全响应方面最高可实现100%的成功率。通过将自适应数据策展与基于生命周期的缓解策略相结合,本研究为减轻越狱风险、确保LLMs的安全适配迈出了坚实的一步。