Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or auxiliary reward models. In this paper, we argue that, because safety features are inherently sparse within the output distribution, alignment requires localized modifications rather than global trade-offs. To this end, we propose SafeSteer, which performs on-policy distillation confined to safety tokens. First, we construct a safety teacher via activation steering. Based on this teacher, we develop a safety token selection algorithm. Consequently, SafeSteer restricts the reverse KL penalty to these tokens during training to preserve general capabilities. Experimental results across diverse models show that our SafeSteer achieves a superior trade-off between safety and general capability compared with existing methods, attaining strong safety performance on seven safety benchmarks with only minimal degradation on five general capability benchmarks. Notably, SafeSteer requires only 100 harmful samples without using any general-purpose data, less than 1% of what previous baselines used, considerably reducing alignment cost. More details are on our project page at https://anjingkun.github.io/SafeSteer.
翻译:为应对大语言模型(LLMs)与人类价值观对齐时普遍存在的通用能力退化问题(即对齐税),现有方法通过平衡双重目标加以缓解,但严重依赖大量通用数据或辅助奖励模型。本文提出,由于安全特征在输出分布中天然稀疏,对齐仅需局部修正而非全局权衡。为此,我们提出SafeSteer方法,通过限定于安全令牌的在策略蒸馏实现对齐。首先,我们利用激活转向构建安全教师模型;基于该教师模型,设计安全令牌选择算法。训练过程中,SafeSteer将反向KL散度惩罚限制于这些令牌,以保留通用能力。跨多类模型的实验表明,相较于现有方法,SafeSteer在安全性与通用能力之间实现了更优权衡:在七个安全基准上取得强安全性表现,同时仅对五个通用能力基准造成极小退化。值得关注的是,SafeSteer仅需100个有害样本且无需任何通用数据,数据量不足先前基线方法的1%,显著降低对齐成本。更多详情请参见项目页面:https://anjingkun.github.io/SafeSteer。