As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an effective approach for enhancing LLM safety by adding a refusal direction vector to internal activations of LLMs during inference, which will further induce the refusal behaviors of LLMs. However, indiscriminately applying activation steering fundamentally suffers from the trade-off between safety and utility, since the same steering vector can also lead to over-refusal and degraded performance on benign prompts. Although prior efforts, such as vector calibration and conditional steering, have attempted to mitigate this trade-off, their lack of theoretical grounding limits their robustness and effectiveness. To better address the trade-off between safety and utility, we present a theoretically grounded and empirically effective activation steering method called AlphaSteer. Specifically, it considers activation steering as a learnable process with two principled learning objectives: utility preservation and safety enhancement. For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints. For safety enhancement, it learns to construct a refusal direction vector for steering malicious data, with the help of linear regression. Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer, which significantly improves the safety of LLMs without compromising general capabilities. Our codes are available at https://github.com/AlphaLab-USTC/AlphaSteer.
翻译:随着大语言模型(LLM)在现实应用中的日益普及,确保其能够拒绝恶意提示(尤其是越狱攻击)对于安全可靠的使用至关重要。近年来,激活导向作为一种增强LLM安全性的有效方法而兴起,其通过在推理过程中向LLM的内部激活添加一个拒绝方向向量,从而进一步诱导LLM的拒绝行为。然而,不加区分地应用激活导向本质上会面临安全性与实用性之间的权衡,因为相同的导向向量也可能导致对良性提示的过度拒绝和性能下降。尽管先前的研究,如向量校准和条件导向,已尝试缓解这种权衡,但其缺乏理论依据,限制了方法的鲁棒性和有效性。为了更好地解决安全性与实用性之间的权衡,我们提出了一种理论依据充分且实证有效的激活导向方法,称为AlphaSteer。具体而言,它将激活导向视为一个可学习的过程,并设定两个基于原理的学习目标:实用性保持与安全性增强。对于实用性保持,该方法学习在零空间约束下,为良性数据构建一个近乎零的导向向量。对于安全性增强,它借助线性回归学习为恶意数据构建一个拒绝方向向量。在多种越狱攻击和实用性基准测试上的实验证明了AlphaSteer的有效性,该方法在显著提升LLM安全性的同时,并未损害其通用能力。我们的代码公开于 https://github.com/AlphaLab-USTC/AlphaSteer。