Safety-aligned LLMs suffer from two failure modes: jailbreak (answering harmful inputs) and over-refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fundamental trade-off -- reducing jailbreak increases over-refusal and vice versa. We identify the root cause: LLMs encode the decision to answer (answer vector $v_a$) and the judgment of input safety (benign vector $v_b$) as nearly orthogonal directions, treating them as independent processes. We propose LLM-VA, which aligns $v_a$ with $v_b$ through closed-form weight updates, making the model's willingness to answer causally dependent on its safety assessment -- without fine-tuning or architectural changes. Our method identifies vectors at each layer using SVMs, selects safety-relevant layers, and iteratively aligns vectors via minimum-norm weight modifications. Experiments on 12 LLMs demonstrate that LLM-VA achieves 11.45% higher F1 than the best baseline while preserving 95.92% utility, and automatically adapts to each model's safety bias without manual tuning. Code and models are available at https://hotbento.github.io/LLM-VA-Web/.
翻译:安全对齐的大型语言模型存在两种失效模式:越狱(回答有害输入)和过度拒绝(拒绝良性查询)。现有的向量导向方法通过调整回答向量的幅度来应对,但这造成了根本性的权衡——减少越狱会增加过度拒绝,反之亦然。我们发现了根本原因:LLM 将回答决策(回答向量 $v_a$)和输入安全性判断(良性向量 $v_b$)编码为近乎正交的方向,并将其视为独立的过程。我们提出了 LLM-VA,它通过闭式权重更新将 $v_a$ 与 $v_b$ 对齐,使得模型回答的意愿因果依赖于其安全性评估——无需微调或架构更改。我们的方法使用支持向量机识别每一层的向量,选择与安全相关的层,并通过最小范数权重修改迭代地对齐向量。在 12 个 LLM 上的实验表明,LLM-VA 在保持 95.92% 实用性的同时,F1 分数比最佳基线高出 11.45%,并且无需手动调整即可自动适应每个模型的安全偏差。代码和模型可在 https://hotbento.github.io/LLM-VA-Web/ 获取。