Large Language Models (LLMs) are susceptible to `jailbreaking' prompts, which can induce the generation of harmful content. This paper demonstrates that moderate WANDA pruning (Sun et al., 2023) can increase their resistance to such attacks without the need for fine-tuning, while maintaining performance on standard benchmarks. Our findings suggest that the benefits of pruning correlate with the initial safety levels of the model, indicating a regularizing effect of WANDA pruning. We introduce a dataset of 225 harmful tasks across five categories to systematically evaluate this safety enhancement. We argue that safety improvements can be understood through a regularization perspective. First, we show that pruning helps LLMs focus more effectively on task-relevant tokens within jailbreaking prompts. Then, we analyze the effects of pruning on the perplexity of malicious prompts before and after their integration into jailbreak templates. Finally, we demonstrate statistically significant performance improvements under domain shifts when applying WANDA to linear models.
翻译:大语言模型(LLMs)易受“越狱”提示的攻击,这类提示可能导致有害内容的生成。本文证明,适度的WANDA剪枝(Sun等人,2023)能够在无需微调的情况下增强模型对此类攻击的抵抗能力,同时保持其在标准基准测试中的性能。我们的研究结果表明,剪枝的益处与模型初始安全水平相关,体现了WANDA剪枝的正则化效应。我们引入了一个包含225个有害任务(涵盖五个类别)的数据集,以系统评估这种安全增强效果。我们认为,安全改进可以通过正则化视角来理解。首先,我们展示剪枝有助于LLMs更有效地聚焦于越狱提示中与任务相关的标记。随后,我们分析了剪枝对恶意提示在集成到越狱模板前后困惑度的影响。最后,我们验证了将WANDA应用于线性模型时,在领域迁移下具有统计显著性的性能提升。