Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.
翻译:指令跟随语言模型经过训练以提供帮助并确保安全,然而其安全行为可能在良性微调下退化,并在对抗性更新下进一步恶化。现有防御方法通常提供有限保护,或迫使安全性与实用性之间做出权衡。我们提出一种训练框架,该框架根据安全风险自适应调整正则化强度,使模型能够在整个微调过程中保持对齐状态。为在训练时估计安全风险,我们探索了两种不同方法:基于评判器的安全评判器(Safety Critic),为训练批次分配高级别危害分数;以及基于激活的风险预测器,通过训练在中间模型激活上的轻量级分类器来估计有害意图。每种方法提供的风险信号被用于约束被视为较高风险的更新,使其保持接近安全参考策略,而较低风险的更新则继续进行标准训练。我们通过实验验证:有害意图信号可从生成前激活中预测,且评判器分数能提供有效的高召回安全指导。在多种模型系列和攻击场景中,采用任一风险估计方法的自适应正则化,相比标准微调持续降低了攻击成功率,保持了下游性能,且未增加推理时成本。这项工作展示了一种在不牺牲实用性的前提下维持安全性的原理性机制。