We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction. On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and 80B models and can also induce targeted refusals when desired. We analize the steering vectors and show that refusal signals concentrate in deeper layers of the transformer and are distributed across many dimensions. Together, these results demonstrate that activation steering can remove political refusal behaviour while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time.
翻译:我们提出拒绝引导方法,这是一种无需重新训练即可在推理阶段对大型语言模型在政治敏感话题上的拒绝行为实施细粒度控制的技术。该方法摒弃了脆弱的基于模式的拒绝检测机制,转而采用LLM作为评判器来分配拒绝置信度分数,并提出一种岭正则化变体来计算能更好分离拒绝-遵从方向的引导向量。在Qwen3-Next-80B-A3B-Thinking模型上,我们的方法成功消除了模型对政治敏感话题的拒绝行为,同时在JailbreakBench基准测试中保持安全性,并在通用基准测试中维持接近基线的性能。该方法可泛化至4B和80B规模的不同模型,并能根据需要诱导定向拒绝行为。我们通过分析引导向量发现,拒绝信号集中分布在Transformer架构的深层,并分散于多个维度。这些结果表明,激活引导技术能够在保留对有害内容安全对齐的同时消除政治性拒绝行为,为推理阶段实现可控、透明的审核机制提供了可行路径。