From Refusal Geometry to Safety Geometry: Harmfulness--Refusal Coupling under Dynamic Adversarial Fine-Tuning

Safety alignment requires language models to refuse harmful requests without losing the ability to answer benign ones. Existing robustness evaluations, however, do not reveal whether a model has learned to recognize harmfulness, to activate a refusal policy, or to couple these two processes. We study this question with a dual safety-geometry protocol that measures harmfulness carriers, refusal carriers, and their coupling across aligned instruction-tuned anchors and matched Mistral-7B-v0.1 SFT/R2D2 training trajectories. The aligned anchors validate the protocol: refusal-side interventions reopen attack success more strongly than harmfulness-only interventions, while harmfulness and refusal carriers remain nearly orthogonal. Along the Mistral trajectory, R2D2 exhibits a high-coupling early phase with strong fixed-source robustness, saturated safe-prompt refusal, and collapsed benign utility. Later checkpoints move to a lower-coupling regime with partial utility recovery and reopened attack success. SFT provides an important contrast: it also reaches low coupling, but remains substantially less robust, showing that low coupling alone is not a safety guarantee. All-anchor diagnostics and sparse GCG/AutoDAN transfer experiments further show that H/R coupling is informative in the R2D2 regime, whereas SFT transfer is better summarized by drift or behavior-state measures. Causal sweeps support fixed-protocol sensitivity relative to matched unit-direction controls, but do not establish independent harmfulness and refusal pathways. These results frame harmfulness--refusal coupling as an operational diagnostic for safety-geometry dynamics under adversarial fine-tuning.

翻译：安全对齐要求语言模型在拒绝有害请求的同时，不丧失回答良性问题的能力。然而，现有的鲁棒性评估并未揭示模型是否学会了识别有害性、是否激活了拒绝策略，或是否将这两个过程耦合起来。我们通过双重安全几何协议研究这一问题，该协议测量了有害性载体、拒绝载体及其跨对齐指令微调锚点和配对的Mistral-7B-v0.1 SFT/R2D2训练轨迹的耦合程度。对齐锚点验证了该协议：仅针对拒绝侧的干预比仅针对有害性的干预更能重新打开攻击成功率，同时有害性和拒绝载体保持近似正交。沿着Mistral轨迹，R2D2表现出高耦合的早期阶段，具有强固定源鲁棒性、饱和的安全提示拒绝以及崩溃的良性效用。后来的检查点过渡到较低耦合状态，伴随部分效用恢复和重新出现的攻击成功率。SFT提供了一个重要对比：它也达到了低耦合，但鲁棒性显著降低，表明仅低耦合本身并非安全保证。全锚点诊断和稀疏GCG/AutoDAN迁移实验进一步表明，在R2D2机制下H/R耦合具有信息性，而SFT迁移则更适用于漂移或行为状态度量来总结。因果扫描支持相对于匹配单位方向对照的固定协议敏感性，但未能建立独立的有害性和拒绝路径。这些结果将有害性-拒绝耦合框架化为对抗微调下安全几何动态的操作性诊断指标。