Activation steering is a practical post-training model alignment technique to enhance the utility of Large Language Models (LLMs). Prior to deploying a model as a service, developers can steer a pre-trained model toward specific behavioral objectives, such as compliance or instruction adherence, without the need for retraining. This process is as simple as adding a steering vector to the model's internal representations. However, this capability unintentionally introduces critical and under-explored safety risks. We identify a phenomenon termed Steering Externalities, where steering vectors derived from entirely benign datasets-such as those enforcing strict compliance or specific output formats like JSON-inadvertently erode safety guardrails. Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks by bypassing the initial safety alignment. Ultimately, our results expose a critical blind spot in deployment: benign activation steering systematically erodes the "safety margin," rendering models more vulnerable to black-box attacks and proving that inference-time utility improvements must be rigorously audited for unintended safety externalities.
翻译:激活引导是一种实用的训练后模型对齐技术,旨在增强大型语言模型(LLMs)的效用。在将模型部署为服务之前,开发者可以引导预训练模型朝向特定的行为目标,例如合规性或指令遵循,而无需重新训练。此过程简单到只需向模型的内部表示添加一个引导向量。然而,这种能力无意中引入了关键且未被充分探索的安全风险。我们发现了一种称为“引导外部性”的现象,即源自完全良性数据集(例如那些强制执行严格合规性或特定输出格式如JSON的数据集)的引导向量,会无意中削弱安全护栏。实验表明,这些干预措施起到了力量倍增器的作用,创造了新的越狱漏洞,并通过绕过初始的安全对齐,将标准基准测试上的攻击成功率提高到80%以上。最终,我们的结果揭示了部署中的一个关键盲点:良性激活引导系统地侵蚀了“安全边际”,使得模型更容易受到黑盒攻击,并证明推理时效用改进必须经过严格审计,以防范意外的安全外部性。