Aligned LLMs are secure, capable of recognizing and refusing to answer malicious questions. However, the role of internal parameters in maintaining such security is not well understood yet, further these models can be vulnerable to security degradation when fine-tuned with non-malicious backdoor or normal data. To address these challenges, our work uncovers the mechanism behind security in aligned LLMs at the parameter level, identifying a small set of contiguous layers in the middle of the model that are crucial for distinguishing malicious queries from normal ones, referred to as "safety layers". We first confirm the existence of these safety layers by analyzing variations in input vectors within the model's internal layers. Additionally, we leverage the over-rejection phenomenon and parameters scaling analysis to precisely locate the safety layers. Building on these findings, we propose a novel fine-tuning approach, Safely Partial-Parameter Fine-Tuning (SPPFT), that fixes the gradient of the safety layers during fine-tuning to address the security degradation. Our experiments demonstrate that the proposed approach can significantly preserve LLM security while maintaining performance and reducing computational resources compared to full fine-tuning.
翻译:对齐后的大语言模型具备安全性,能够识别并拒绝回答恶意问题。然而,内部参数在维持此类安全中的作用尚未得到充分理解,此外,当使用非恶意的后门数据或正常数据进行微调时,这些模型可能面临安全性下降的风险。为应对这些挑战,本研究在参数层面揭示了对齐大语言模型中安全机制的工作原理,识别出模型中间一小部分连续的层对于区分恶意查询与正常查询至关重要,这些层被称为“安全层”。我们首先通过分析模型内部层中输入向量的变化来确认这些安全层的存在。此外,我们利用过度拒绝现象和参数缩放分析来精确定位安全层。基于这些发现,我们提出了一种新颖的微调方法——安全部分参数微调(SPPFT),该方法在微调过程中固定安全层的梯度,以解决安全性下降问题。我们的实验表明,与全参数微调相比,所提出的方法在保持模型性能并减少计算资源的同时,能显著维护大语言模型的安全性。