Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models

Ensuring Large Language Model (LLM) safety is crucial, yet the lack of a clear understanding about safety mechanisms hinders the development of precise and reliable methodologies for safety intervention across diverse tasks. To better understand and control LLM safety, we propose the Expected Safety Impact (ESI) framework for quantifying how different parameters affect LLM safety. Based on ESI, we reveal distinct safety-critical patterns across different LLM architectures: In dense LLMs, many safety-critical parameters are located in value matrices (V) and MLPs in middle layers, whereas in Mixture-of-Experts (MoE) models, they shift to the late-layer MLPs. Leveraging ESI, we further introduce two targeted intervention paradigms for safety enhancement and preservation, i.e., Safety Enhancement Tuning (SET) and Safety Preserving Adaptation (SPA). SET can align unsafe LLMs by updating only a few safety-critical parameters, effectively enhancing safety while preserving original performance. SPA safeguards well-aligned LLMs during capability-oriented intervention (e.g., instruction tuning) by preventing disruption of safety-critical weights, allowing the LLM to acquire new abilities and maintain safety capabilities. Extensive evaluations on different LLMs demonstrate that SET can reduce the attack success rates of unaligned LLMs by over 50% with only a 100-iteration update on 1% of model weights. SPA can limit the safety degradation of aligned LLMs within 1% after a 1,000-iteration instruction fine-tuning on different tasks. Our code is available at: https://github.com/ZJU-LLM-Safety/SafeWeights-ACL.

翻译：确保大语言模型（LLM）的安全性至关重要，然而对安全机制缺乏清晰理解阻碍了针对不同任务开发精准可靠的安全干预方法。为了更好地理解和控制LLM的安全性，我们提出了预期安全影响（ESI）框架，用于量化不同参数对LLM安全性的影响。基于ESI，我们揭示了不同LLM架构中不同的安全关键模式：在密集LLM中，许多安全关键参数位于中间层的值矩阵（V）和MLP中，而在混合专家（MoE）模型中，这些参数转向了晚期层的MLP。利用ESI，我们进一步引入了两种针对性的干预范式：安全增强微调（SET）和安全保持适配（SPA）。SET通过仅更新少量安全关键参数来对齐不安全的LLM，有效增强安全性同时保持原始性能。SPA通过防止破坏安全关键权重，在面向能力的干预（如指令微调）过程中保护良好对齐的LLM，使LLM能够获取新能力并维持安全能力。对不同LLM的广泛评估表明，SET仅需对1%的模型权重进行100次迭代更新，即可将未对齐LLM的攻击成功率降低50%以上。SPA可在对不同任务进行1000次迭代指令微调后，将对齐LLM的安全退化限制在1%以内。我们的代码开源在：https://github.com/ZJU-LLM-Safety/SafeWeights-ACL。