Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable to adversarial misuse. While prior work has shown that safety-relevant features are encoded in structured representations within the model's activation space, how these representations change during fine-tuning and why alignment degrades remains poorly understood. In this work, we investigate the representation-level mechanisms underlying alignment degradation. Our analysis shows that standard fine-tuning induces systematic drift in safety-relevant representations, distorts their geometric structure, and introduces interference between task optimization and safety features. These effects collectively lead to increased harmful compliance. Motivated by these findings, we introduce REFUSALGUARD, a representation-level fine-tuning framework that preserves safety-relevant structure during model adaptation. Our approach constrains updates in hidden representation space, ensuring that safety-mediating components remain stable while allowing task-specific learning in complementary directions. We evaluate REFUSALGUARD across multiple model families, including LLaMA, Gemma, and Qwen, on adversarial safety benchmarks such as AdvBench, DirectHarm4, and JailbreakBench, as well as downstream utility tasks. Our approach achieves attack success rates comparable to base safety-aligned models while maintaining competitive task performance, significantly outperforming baselines.
翻译:对经过安全对齐的语言模型进行下游任务微调,往往会导致拒绝行为显著退化,使模型易受对抗性滥用攻击。尽管已有研究表明安全相关特征编码在模型激活空间的结构化表征中,但这些表征在微调过程中的变化机制及对齐退化的根本原因仍不明确。本研究从表征层面出发,系统探究了对齐退化的内在机理。分析表明,标准微调会导致安全相关表征产生系统性偏移、扭曲其几何结构,并在任务优化与安全特征间引入干扰效应。这些因素共同作用导致有害指令遵从率上升。基于此发现,我们提出REFUSALGUARD——一种在模型适应过程中保持安全相关结构的表征级微调框架。该方法通过约束隐藏表征空间的更新方向,在允许任务特定学习沿互补方向展开的同时,确保安全中间组件保持稳定。我们在LLaMA、Gemma和Qwen等多个模型家族上,采用AdvBench、DirectHarm4和JailbreakBench等对抗性安全基准及下游效用任务进行评估。实验结果表明,本方法在攻击成功率上与基础安全对齐模型相当,同时保持具有竞争力的任务性能,显著优于各类基线方法。