Prior work argues that refusal in large language models is mediated by a single activation-space direction, enabling effective steering and ablation. We show that this account is incomplete. Across eleven categories of refusal and non-compliance, including safety, incomplete or unsupported requests, anthropomorphization, and over-refusal, we find that these refusal behaviors correspond to geometrically distinct directions in activation space. Yet despite this diversity, linear steering along any refusal-related direction produces nearly identical refusal to over-refusal trade-offs, acting as a shared one-dimensional control knob. The primary effect of different directions is not whether the model refuses, but how it refuses.
翻译:先前的研究认为,大语言模型中的拒绝行为是由单一激活空间方向所介导的,这使得有效的引导和消融成为可能。我们证明这一解释并不完整。在涵盖安全性、不完整或不支持的请求、拟人化以及过度拒绝等十一个类别的拒绝与非遵从行为中,我们发现这些拒绝行为对应于激活空间中几何上截然不同的方向。然而,尽管存在这种多样性,沿着任何与拒绝相关的方向进行线性引导,都会产生几乎相同的拒绝与过度拒绝之间的权衡,其作用如同一个共享的一维控制旋钮。不同方向的主要影响不在于模型是否拒绝,而在于模型如何拒绝。