From Refusal Geometry to Safety Geometry: Harmfulness--Refusal Coupling under Dynamic Adversarial Fine-Tuning

Safety alignment requires language models to refuse harmful requests without losing the ability to answer benign ones. Existing robustness evaluations, however, do not reveal whether a model has learned to recognize harmfulness, to activate a refusal policy, or to couple these two processes. We study this question with a dual safety-geometry protocol that measures harmfulness carriers, refusal carriers, and their coupling across aligned instruction-tuned anchors and matched Mistral-7B-v0.1 SFT/R2D2 training trajectories. The aligned anchors validate the protocol: refusal-side interventions reopen attack success more strongly than harmfulness-only interventions, while harmfulness and refusal carriers remain nearly orthogonal. Along the Mistral trajectory, R2D2 exhibits a high-coupling early phase with strong fixed-source robustness, saturated safe-prompt refusal, and collapsed benign utility. Later checkpoints move to a lower-coupling regime with partial utility recovery and reopened attack success. SFT provides an important contrast: it also reaches low coupling, but remains substantially less robust, showing that low coupling alone is not a safety guarantee. All-anchor diagnostics and sparse GCG/AutoDAN transfer experiments further show that H/R coupling is informative in the R2D2 regime, whereas SFT transfer is better summarized by drift or behavior-state measures. Causal sweeps support fixed-protocol sensitivity relative to matched unit-direction controls, but do not establish independent harmfulness and refusal pathways. These results frame harmfulness--refusal coupling as an operational diagnostic for safety-geometry dynamics under adversarial fine-tuning.

翻译：安全对齐要求语言模型在拒绝有害请求的同时，不丧失回答良性问题的能力。然而，现有的鲁棒性评估并未揭示模型是否学会了识别有害性、激活拒绝策略，或耦合这两个过程。我们采用一种双重安全几何协议研究该问题，该协议测量了对话对齐锚点及匹配的Mistral-7B-v0.1 SFT/R2D2训练轨迹上的有害性载体、拒绝载体及其耦合程度。对齐锚点验证了该协议：仅针对拒绝侧的干预比仅针对有害性的干预更能重新开启攻击成功，而有害性与拒绝载体仍保持近似正交。沿Mistral轨迹，R2D2在早期阶段呈现高耦合，具有强固定来源鲁棒性、饱和安全提示拒绝及良性效用的崩溃。后续检查点进入较低耦合状态，伴随部分效用恢复与攻击成功的重新开启。SFT提供了重要对比：它也达到低耦合，但鲁棒性显著较低，表明低耦合本身并非安全保证。所有锚点的诊断与稀疏GCG/AutoDAN迁移实验进一步显示，H/R耦合在R2D2体制中具有信息性，而SFT迁移则更适用于漂移或行为状态度量进行总结。因果扫描支持相对于匹配单位方向对照的固定协议敏感性，但未建立独立的有害性与拒绝路径。这些结果将有害性-拒绝耦合定位为对抗微调下安全几何动力学的操作性诊断指标。