On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.
翻译:在线策略蒸馏(On-policy distillation, OPD)通过利用更强教师模型的监督,在其自身诱导分布下训练学生模型。我们发现OPD存在一种失效模式:随着训练进行,在线策略生成的序列可能出现突发的长度膨胀,导致截断轨迹主导训练数据。这种截断坍塌与突发的重复饱和现象同步发生,并引发有偏的梯度信号,进而造成严重的训练不稳定性和验证性能的急剧下降。我们将此问题归因于学生模型诱导的数据收集过程与蒸馏目标之间的交互作用——该过程隐性地倾向于生成冗长且重复的序列。为解决此问题,我们提出StableOPD——一种稳定的OPD框架,该框架结合了基于参考的散度约束与轨迹混合蒸馏策略。这两种机制共同缓解了由重复性导致的长度膨胀,并进一步稳定了OPD训练过程。在多个数学推理数据集上,我们的方法有效防止了截断坍塌,稳定了训练动态,并平均提升了7.2%的性能。