Large language models (LLMs) show inherent brittleness in their safety mechanisms, as evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We develop methods to identify critical regions that are vital for safety guardrails, and that are disentangled from utility-relevant regions at both the neuron and rank levels. Surprisingly, the isolated regions we find are sparse, comprising about $3\%$ at the parameter level and $2.5\%$ at the rank level. Removing these regions compromises safety without significantly impacting utility, corroborating the inherent brittleness of the model's safety mechanisms. Moreover, we show that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted. These findings underscore the urgent need for more robust safety strategies in LLMs.
翻译:大型语言模型(LLMs)的安全机制存在固有脆弱性,这表现为其对越狱攻击甚至非恶意微调均具有敏感性。本研究通过剪枝与低秩修改方法探索这种安全对齐的脆弱性。我们开发了识别安全护栏关键区域的方法,并在神经元层面与秩层面将这些区域与效用相关区域解耦。令人惊讶的是,我们发现的孤立区域具有稀疏性——参数层级占比约3%,秩层级占比约2.5%。移除这些区域会在不显著影响效用的前提下破坏安全性,印证了模型安全机制的固有脆弱性。此外,研究显示即便限制对安全关键区域的修改,LLMs仍易受低成本微调攻击。这些发现凸显了开发LLMs更鲁棒安全策略的紧迫性。