As large language models (LLMs) are overwhelmingly more and more integrated into various applications, ensuring they generate safe responses is a pressing need. Previous studies on alignment have largely focused on general instruction-following but have often overlooked the distinct properties of safety alignment, such as the brittleness of safety mechanisms. To bridge the gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment teaches an otherwise unsafe model to choose the correct reasoning direction-fulfill or refuse users' requests-interpreted as an implicit binary classification task. Through SSAH, we hypothesize that only a few essential components can establish safety guardrails in LLMs. We successfully identify four types of attribute-critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU), and Redundant Unit (RU). Our findings show that freezing certain safety-critical components during fine-tuning allows the model to retain its safety attributes while adapting to new tasks. Similarly, we show that leveraging redundant units in the pre-trained model as an "alignment budget" can effectively minimize the alignment tax while achieving the alignment goal. All considered, this paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and underscores that safety alignment should not be complicated. We have code implementation and other information on the project website: https://ssa-h.github.io/.
翻译:随着大型语言模型(LLMs)日益广泛地集成到各类应用中,确保其生成安全的响应已成为迫切需求。先前关于对齐的研究主要集中于通用指令遵循,但往往忽视了安全对齐的独特属性,例如安全机制的脆弱性。为弥补这一差距,我们提出了表面安全对齐假说(SSAH),该假说认为安全对齐教导一个原本不安全的模型选择正确的推理方向——满足或拒绝用户的请求——这被解释为一个隐式的二分类任务。通过SSAH,我们假设仅需少数关键组件即可在LLMs中建立安全护栏。我们成功识别出四种属性关键组件类型:安全关键单元(SCU)、效用关键单元(UCU)、复杂单元(CU)和冗余单元(RU)。我们的研究结果表明,在微调过程中冻结某些安全关键组件,可以使模型在适应新任务的同时保留其安全属性。类似地,我们证明利用预训练模型中的冗余单元作为“对齐预算”,可以在实现对齐目标的同时有效最小化对齐税。综上所述,本文得出结论:LLMs中安全性的原子功能单元位于神经元层面,并强调安全对齐不应复杂化。代码实现及其他信息详见项目网站:https://ssa-h.github.io/。