Large Language Models (LLMs) are vulnerable to `Jailbreaking' prompts, a type of attack that can coax these models into generating harmful and illegal content. In this paper, we show that pruning up to 20% of LLM parameters markedly increases their resistance to such attacks without additional training and without sacrificing their performance in standard benchmarks. Intriguingly, we discovered that the enhanced safety observed post-pruning correlates to the initial safety training level of the model, hinting that the effect of pruning could be more general and may hold for other LLM behaviors beyond safety. Additionally, we introduce a curated dataset of 225 harmful tasks across five categories, inserted into ten different Jailbreaking prompts, showing that pruning aids LLMs in concentrating attention on task-relevant tokens in jailbreaking prompts. Lastly, our experiments reveal that the prominent chat models, such as LLaMA-2 Chat, Vicuna, and Mistral Instruct exhibit high susceptibility to jailbreaking attacks, with some categories achieving nearly 70-100% success rate. These insights underline the potential of pruning as a generalizable approach for improving LLM safety, reliability, and potentially other desired behaviors.
翻译:大语言模型(LLMs)容易受到“越狱”提示的影响,这类攻击可诱导模型生成有害和非法内容。本文证明,在无需额外训练且不牺牲标准基准性能的情况下,剪枝LLMs中高达20%的参数能显著增强其对此类攻击的抵抗力。有趣的是,我们观察到剪枝后安全性的提升与模型初始安全训练水平相关,表明剪枝效应可能具有普适性,并可能适用于除安全性外的其他LLM行为。此外,我们引入了一个包含225个有害任务(分属五个类别)并嵌入十种不同越狱提示的人工整理数据集,结果表明剪枝有助于LLMs将注意力集中在越狱提示中任务相关的令牌上。最后,实验揭示出LLaMA-2 Chat、Vicuna和Mistral Instruct等主流对话模型对越狱攻击高度敏感,部分类别攻击成功率接近70-100%。这些发现强调了剪枝作为提升LLM安全性、可靠性及其他潜在期望行为的通用方法的潜力。