On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the learning signal drawn from this distributional gap concentrates on style tokens rather than task-bearing ones, as the hinted model tends to produce more direct, shorter outputs. We term this pathology \emph{privilege-induced style drift}, which destabilizes training or causes response length to shrink. To address this, we propose \textbf{RLCSD} (Reinforcement Learning with Contrastive on-policy Self-Distillation), which mitigates this drift by contrasting the teacher-student gap under a correct hint against that under a wrong hint, suppressing the style shift that conditioning on a hint tends to induce regardless of correctness, and yielding a signal that is more concentrated on task-bearing tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across mathematical and logical reasoning show that RLCSD consistently outperforms GRPO and prior OPSD methods. We further show that the contrastive principle is general: it plugs into existing OPSD methods to improve them, and its underlying insight extends to the broader cross-model on-policy distillation setting.
翻译:在线自蒸馏通过将模型自身的分布与其在特权上下文(通常为已验证解)下产生的分布对齐,为推理模型提供密集的令牌级监督信号。然而,我们表明这种分布差异中提取的学习信号集中于风格令牌而非承载任务的令牌,因为提示模型倾向于生成更直接、更短的输出。我们将此病理现象称为**特权诱导风格漂移**,它会导致训练不稳定或响应长度缩减。为解决此问题,我们提出**RLCSD**(对比在线自蒸馏强化学习),该方法通过对比正确提示下的师生差异与错误提示下的师生差异来缓解这种漂移,从而抑制无论提示正确与否都倾向于发生的风格转移,并产生更集中于任务承载令牌的信号。在Qwen3(1.7B/4B/8B)和Olmo-3-7B-Think上的数学与逻辑推理实验表明,RLCSD始终优于GRPO及先前的在线自蒸馏方法。我们进一步证明该对比原则具有通用性:它可嵌入现有在线自蒸馏方法中提升其性能,其核心见解也可扩展至更广泛的跨模型在线蒸馏场景。