Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference. It is often framed as a precise, interpretable, and potentially safer alternative to fine-tuning. We demonstrate the opposite: steering systematically breaks model alignment safeguards, making it comply with harmful requests. Through extensive experiments on different model families, we show that even steering in a random direction can increase the probability of harmful compliance from 0% to 1-13%. Alarmingly, steering benign features from a sparse autoencoder (SAE), a common source of interpretable directions, demonstrates a comparable harmful potential. Finally, we show that combining 20 randomly sampled vectors that jailbreak a single prompt creates a universal attack, significantly increasing harmful compliance on unseen requests. These results challenge the paradigm of safety through interpretability, showing that precise control over model internals does not guarantee precise control over model behavior.
翻译:激活导向是一种通过在模型推理过程中直接向隐藏状态添加具有语义意义的向量来控制大型语言模型行为的前沿技术。该方法常被定位为比微调更精确、可解释且潜在更安全的替代方案。我们证明了相反的情况:导向技术会系统性破坏模型的安全对齐防护机制,使其遵循有害请求。通过对不同模型系列的广泛实验,我们发现即使随机方向的导向也能将有害请求的遵从概率从0%提升至1-13%。令人警惕的是,从稀疏自编码器(SAE)——可解释向量的常见来源——提取良性特征进行导向时,同样展现出相当的有害潜力。最后,我们证明将20个能破解单个提示的随机采样向量组合后,可形成通用攻击方案,显著提升模型对未见请求的有害遵从率。这些发现挑战了"通过可解释性实现安全"的范式,表明对模型内部状态的精确控制并不能保证对模型行为的精准调控。