The new paradigm of finetuning-as-a-service introduces a new attack surface for Large Language Models (LLMs): a few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. We conduct an empirical analysis and uncover a \textit{harmful embedding drift} phenomenon, showing a probable cause of the alignment-broken effect. Inspired by our findings, we propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning. The core idea of Vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. This enables the embeddings to withstand harmful perturbation from un-sanitized user data in the finetuning phase. Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. Our code is available at \url{https://github.com/git-disl/Vaccine}.
翻译:“微调即服务”这一新范式为大语言模型引入了新的攻击面:用户上传的少量有害数据可轻易欺骗微调过程,生成对齐失效的模型。通过实证分析,我们发现了“有害嵌入漂移”现象,这可能是导致对齐失效的原因之一。受此发现启发,我们提出了疫苗(Vaccine)——一种扰动感知对齐技术,用于缓解用户微调带来的安全风险。其核心思想是在对齐阶段通过逐步向隐藏嵌入添加精心构造的扰动,使其产生不变的嵌入表示。这使嵌入能够承受微调阶段未经净化的用户数据带来的有害扰动。我们在开源主流大语言模型(如Llama2、Opt、Vicuna)上的实验结果表明,疫苗能增强对齐对有害提示诱发的嵌入漂移的鲁棒性,同时保持对良性提示的推理能力。我们的代码已开源在:\url{https://github.com/git-disl/Vaccine}。