Activation steering has emerged as a powerful tool to shape LLM behavior without the need for weight updates. While its inherent brittleness and unreliability are well-documented, its safety implications remain underexplored. In this work, we present a systematic safety audit of steering vectors obtained with Contrastive Activation Addition (CAA), a widely used steering approach, under a unified evaluation protocol. Using JailbreakBench as benchmark, we show that steering vectors consistently influence the success rate of jailbreak attacks, with stronger amplification under simple template-based attacks. Across LLM families and sizes, steering the model in specific directions can drastically increase (up to 57%) or decrease (up to 50%) its attack success rate (ASR), depending on the targeted behavior. We attribute this phenomenon to the overlap between the steering vectors and the latent directions of refusal behavior. Thus, we offer a traceable explanation for this discovery. Together, our findings reveal the previously unobserved origin of this safety gap in LLMs, highlighting a trade-off between controllability and safety.
翻译:激活引导已成为一种无需权重更新即可塑造大型语言模型行为的强大工具。尽管其固有的脆弱性和不可靠性已有充分记录,但其安全影响仍未得到充分探索。在本文中,我们采用统一评估协议,对基于对比激活加法这一广泛使用的引导方法所获得的引导向量进行了系统性安全审计。以JailbreakBench为基准,我们表明引导向量持续影响越狱攻击的成功率,且在简单模板攻击下放大效应更为显著。在不同系列和规模的大型语言模型中,将模型引导至特定方向,可大幅提升(高达57%)或降低(高达50%)其攻击成功率,具体效果取决于目标行为。我们将此现象归因于引导向量与拒绝行为潜在方向之间的重叠。因此,我们为此发现提供了可追溯的解释。综上,我们的研究揭示了大型语言模型中此前未被观测到的安全差距根源,凸显了可控性与安全性之间的权衡。