Large language models (LLMs) are extensively adapted for downstream applications through a process known as "customization," with fine-tuning being a common method for integrating domain-specific expertise. However, recent studies have revealed a vulnerability that tuning LLMs with malicious samples can compromise their robustness and amplify harmful content, an attack known as "jailbreaking." To mitigate such attack, we propose an effective defensive framework utilizing data curation to revise commonsense texts and enhance their safety implication from the perspective of LLMs. The curated texts can mitigate jailbreaking attacks at every stage of the customization process: before customization to immunize LLMs against future jailbreak attempts, during customization to neutralize jailbreaking risks, or after customization to restore the compromised models. Since the curated data strengthens LLMs through the standard fine-tuning workflow, we do not introduce additional modules during LLM inference, thereby preserving the original customization process. Experimental results demonstrate a substantial reduction in jailbreaking effects, with up to a 100% success in generating responsible responses. Notably, our method is effective even with commonsense texts, which are often more readily available than safety-relevant data. With the every-stage defensive framework and supporting experimental performance, this work represents a significant advancement in mitigating jailbreaking risks and ensuring the secure customization of LLMs.
翻译:大型语言模型(LLM)通常通过称为“定制化”的过程被广泛适配于下游应用,其中微调是整合领域专业知识的一种常用方法。然而,近期研究揭示了一种脆弱性:使用恶意样本对LLM进行调优可能会损害其鲁棒性并放大有害内容,这种攻击被称为“越狱”。为缓解此类攻击,我们提出了一种有效的防御框架,利用数据策展来修订常识性文本,并从LLM的视角增强其安全隐含意义。经过策展的文本能够在定制化过程的每一阶段缓解越狱攻击:在定制化之前使LLM对未来越狱尝试产生免疫,在定制化过程中中和越狱风险,或在定制化后恢复受损模型。由于策展数据通过标准的微调工作流程强化了LLM,我们在LLM推理过程中未引入额外模块,从而保留了原有的定制化流程。实验结果表明,越狱效应显著降低,生成负责任响应的成功率高达100%。值得注意的是,我们的方法即使仅使用常识性文本也依然有效,这类文本通常比安全相关数据更易获取。凭借这一覆盖全阶段的防御框架及其实验性能支持,本工作在缓解越狱风险、确保LLM安全定制化方面取得了重要进展。