Large language models (LLMs) show inherent brittleness in their safety mechanisms, as evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We develop methods to identify critical regions that are vital for safety guardrails, and that are disentangled from utility-relevant regions at both the neuron and rank levels. Surprisingly, the isolated regions we find are sparse, comprising about $3\%$ at the parameter level and $2.5\%$ at the rank level. Removing these regions compromises safety without significantly impacting utility, corroborating the inherent brittleness of the model's safety mechanisms. Moreover, we show that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted. These findings underscore the urgent need for more robust safety strategies in LLMs.
翻译:大型语言模型(LLMs)的安全机制表现出固有的脆弱性,这体现在它们易受越狱攻击甚至非恶意的微调影响。本研究通过利用剪枝与低秩修改来探究安全对齐的这种脆弱性。我们开发了在神经元和秩级别上识别对安全护栏至关重要、且与效用相关区域相解耦的关键区域的方法。令人惊讶的是,我们发现这些被隔离的区域是稀疏的,在参数级别上约占 $3\%$,在秩级别上约占 $2.5\%$。移除这些区域会损害安全性,而不会显著影响效用,这证实了模型安全机制固有的脆弱性。此外,我们表明,即使限制对安全关键区域的修改,LLMs 仍然容易受到低成本的微调攻击。这些发现强调了在 LLMs 中制定更鲁棒的安全策略的迫切需求。