On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.
翻译:在线策略蒸馏(OPD)对于大型语言模型(LLM)的后训练具有吸引力,因为它评估教师对学生生成的轨迹的反馈,而非固定的教师痕迹。然而,在长程设置中,常见的采样token变体较为脆弱:它将分布匹配简化为单token信号,并且随着轨迹偏离教师常访问的前缀,其可靠性持续下降。我们从估计器和实现两个角度重新审视OPD。理论上,token级OPD相对于序列级反向KL散度存在偏差,但其最坏情况方差界要严格得多;我们的玩具研究在经验上验证了相同的权衡,更强的未来奖励耦合会导致更高的梯度方差和更不稳定的学习。经验上,我们识别出采样token OPD的三种失效模式:失衡的单token信号、学生生成前缀上不可靠的教师引导,以及由分词器或特殊token不匹配引起的失真。我们通过教师top-K局部支持匹配来解决这些问题,实现为截断反向KL散度结合top-p轨迹采样和特殊token掩码。在单任务数学推理和多任务智能体加数学训练中,该目标相比采样token OPD展现出更稳定的优化和更优的下游性能。