This paper investigates the impact of model compression on the way Large Language Models (LLMs) process prompts, particularly concerning jailbreak resistance. We show that moderate WANDA pruning can enhance resistance to jailbreaking attacks without fine-tuning, while maintaining performance on standard benchmarks. To systematically evaluate this safety enhancement, we introduce a dataset of 225 harmful tasks across five categories. Our analysis of LLaMA-2 Chat, Vicuna 1.3, and Mistral Instruct v0.2 reveals that pruning benefits correlate with initial model safety levels. We interpret these results by examining changes in attention patterns and perplexity shifts, demonstrating that pruned models exhibit sharper attention and increased sensitivity to artificial jailbreak constructs. We extend our evaluation to the AdvBench harmful behavior tasks and the GCG attack method. We find that LLaMA-2 is much safer on AdvBench prompts than on our dataset when evaluated with manual jailbreak attempts, and that pruning is effective against both automated attacks and manual jailbreaking on Advbench.
翻译:本文研究了模型压缩对大语言模型处理提示方式的影响,尤其关注其对越狱抵抗能力的作用。我们证明适度的WANDA剪枝可以在不进行微调的情况下增强对越狱攻击的抵抗能力,同时保持标准基准测试的性能。为系统评估这种安全性提升,我们构建了一个包含五大类别225项有害任务的数据集。通过对LLaMA-2 Chat、Vicuna 1.3和Mistral Instruct v0.2的分析发现,剪枝带来的收益与模型初始安全水平呈正相关。我们通过分析注意力模式变化和困惑度偏移来解读这些结果,证明剪枝后的模型表现出更聚焦的注意力分布以及对人工构造越狱提示的更高敏感性。我们将评估扩展到AdvBench有害行为任务和GCG攻击方法。研究发现,在人工越狱尝试评估下,LLaMA-2在AdvBench提示上的安全性远高于在我们数据集上的表现,且剪枝对自动攻击和AdvBench上的人工越狱均能产生有效防护。