On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in verifiable reasoning tasks, but safety alignment differs in that it is guided by high-level constitutions rather than explicit target answers, making it a natural setting to revisit dense distillation. However, our pilot study show that safety OPSD still suffers from severe collapse: constitutional conditioning contracts the teacher distribution toward short and overly conservative responses, and Reverse KL further amplifies this contraction into reduced expressiveness. We formalize this effect as geometric leakage under safety boundaries in a non-orthogonal semantic space, where safety pressure transfers into the expressiveness dimension. Based on this analysis, we propose Constitutional On-Policy Safe Distillation (COPSD), which first calibrates the teacher through a Cross-SFT cold-start and then performs constitution-conditioned on-policy distillation. Experiments on 12 benchmarks show that COPSD achieves a consistently stronger safety--helpfulness trade-off than baselines while substantially reducing the safety tax on general reasoning ability.
翻译:在线策略自蒸馏(OPSD)通过利用基于特权信息约束的教师模型提供密集的令牌级监督,已发展成为一种高效的训练后范式。先前研究表明,OPSD在可验证推理任务中可能出现崩塌,但安全对齐场景的指导依据是高层次宪法而非显式目标答案,这使其成为重新审视密集蒸馏的自然场景。然而,我们的初步研究表明,安全OPSD仍会遭遇严重崩塌:宪法条件约束导致教师分布收缩至短且过度保守的响应,而反向KL散度进一步将这种收缩放大为表达能力下降。我们将该效应形式化为非正交语义空间中安全边界下的几何泄漏——安全压力会传导至表达能力维度。基于此分析,我们提出基于宪法层面的在线策略安全蒸馏(COPSD),该方法首先通过跨SFT冷启动校准教师模型,随后执行基于宪法条件的在线策略蒸馏。在12个基准测试上的实验表明,COPSD在实现一致性更强的安全-有用性权衡的同时,显著降低了对通用推理能力的安全税。