Surgical Refusal Ablation: Disentangling Safety from Intelligence via Concept-Guided Spectral Cleaning

Safety-aligned language models systematically refuse harmful requests. While activation steering can modulate refusal, ablating the raw "refusal vector" calculated from contrastive harmful and harmless prompts often causes collateral damage and distribution drift. We argue this degradation occurs because the raw vector is polysemantic, entangling the refusal signal with core capability circuits and linguistic style. We introduce Surgical Refusal Ablation (SRA) to distill these steering directions. SRA constructs a registry of independent Concept Atoms representing protected capabilities and stylistic confounds, then uses ridge-regularized spectral residualization to orthogonalize the refusal vector against these directions. This yields a clean refusal direction that targets refusal-relevant structure while minimizing disruption to the model's semantic geometry. Across five models (Qwen3-VL and Ministral series), SRA achieves deep refusal reduction (0-2%) with negligible perplexity impact on Wikitext-2 (mean delta PPL approx. 0.02) and minimal distribution drift. Notably, standard ablation on Qwen3-VL-4B induces severe drift (first-token KL = 2.088), whereas SRA maintains the original distribution (KL = 0.044) while achieving the same 0% refusal rate. Using teacher-forced perplexity on GSM8K and MBPP as a high-resolution capability proxy, we show SRA preserves math and code distributions. These results suggest that common "model damage" is often "Ghost Noise," defined as the spectral bleeding of the dirty refusal direction into capability subspaces.

翻译：安全对齐的语言模型会系统性地拒绝有害请求。虽然激活引导可以调节拒绝行为，但通过对比有害与无害提示计算出的原始"拒绝向量"进行消融通常会导致附带损害和分布漂移。我们认为这种性能退化源于原始向量的多义性——它将拒绝信号与核心能力回路及语言风格纠缠在一起。本文提出手术式拒绝消融方法以提炼这些引导方向。该方法首先构建独立概念原子的注册表，用于表征受保护能力和风格干扰因素，随后采用岭正则化光谱残差化技术，使拒绝向量与这些方向正交化。由此获得纯净的拒绝方向，既能精准定位拒绝相关结构，又能最小化对模型语义几何的破坏。在五个模型上的实验表明，该方法在实现深度拒绝率降低的同时，对Wikitext-2数据集仅产生可忽略的困惑度影响，且分布漂移极小。值得注意的是，标准消融方法会导致严重的分布漂移，而手术式拒绝消融在保持原始分布的同时实现了相同的零拒绝率。通过以教师强制困惑度作为高分辨率能力代理指标，我们证明该方法能完整保持数学推理与代码生成的分布特性。这些结果表明，常见的"模型损伤"往往是"幽灵噪声"，即被污染的拒绝方向向能力子空间的光谱渗漏现象。