Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.
翻译:对话式大语言模型经过指令遵循与安全性微调后,能够响应良性请求而拒绝有害指令。尽管这种拒绝行为在各类对话模型中普遍存在,其内在机制仍未得到充分理解。本研究发现,在13个参数量高达720亿的流行开源对话模型中,拒绝行为均由一个一维子空间所调控。具体而言,对于每个模型,我们均发现存在一个特定方向:若从模型残差流激活中消除该方向,模型将丧失拒绝有害指令的能力;反之,若增强该方向,模型甚至会对无害指令产生拒绝行为。基于此发现,我们提出一种新颖的白盒越狱方法,能够以最小化影响其他能力为代价精准禁用拒绝功能。最后,我们通过机制分析揭示了对抗性后缀如何抑制拒绝调控方向的传播。本研究结果突显出现有安全微调方法的脆弱性。更广泛而言,我们的工作展示了如何通过理解模型内部机制来开发控制模型行为的实用方法。