Safety alignment is indispensable for Large Language Models (LLMs) to defend threats from malicious instructions. However, recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue, limiting their helpfulness. In this paper, we propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns in aligned LLMs. First, SCANS extracts the refusal steering vectors within the activation space and utilizes vocabulary projection to anchor some specific safety-critical layers which influence model refusal behavior. Second, by tracking the hidden state transition, SCANS identifies the steering direction and steers the model behavior accordingly, achieving a balance between exaggerated safety and adequate safety. Experiments show that SCANS achieves new state-of-the-art performance on XSTest and OKTest benchmarks, without impairing their defense capability against harmful queries and maintaining almost unchanged model capability.
翻译:安全对齐对于大语言模型(LLMs)防御恶意指令威胁至关重要。然而,近期研究表明,安全对齐的大语言模型因存在过度安全问题,容易拒绝良性查询,从而限制了其帮助性。本文提出一种安全感知激活导向(SCANS)方法,以缓解对齐大语言模型中的过度安全顾虑。首先,SCANS在激活空间中提取拒绝导向向量,并利用词汇投影锚定影响模型拒绝行为的若干特定安全关键层。其次,通过追踪隐藏状态转移,SCANS识别导向方向并相应调整模型行为,从而在过度安全与充分安全之间实现平衡。实验表明,SCANS在XSTest和OKTest基准测试中取得了新的最先进性能,且未削弱模型对有害查询的防御能力,同时保持了几乎不变的模型能力。