On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.
翻译:同策略蒸馏(On-Policy Distillation,OPD)在强教师模型提供的密集令牌级反馈下,基于学生模型自身的生成轨迹训练学生模型,从而缓解了监督微调(Supervised Fine-Tuning,SFT)的异策略分布偏移和强化学习(Reinforcement Learning,RL)的稀疏信用分配问题。然而,标准OPD面临两个相互关联的局限性:第一,它要求直接访问教师模型的令牌级对数概率,从而排除了大量可作为教师模型的强大闭源模型;第二,令牌级对数概率信号本身具有脆弱性,它依赖于教师与学生之间有限的重叠可能下一个令牌,并且容易放大诸如重复循环等退化模式。本文提出OmniOPD,这是一个通过无对数概率、块级监督信号同时解决上述两种局限性的新型框架。OmniOPD用蒙特卡洛展开替代确定性的对数概率匹配,通过多令牌块上的连续语义相似度度量来近似教师的局部偏好,并通过一个峰值熵调度器将这种监督集中在学生模型高不确定性推理分支上进行审计。此外,狄利克雷-多项贝叶斯先验和基础模型KL锚点进一步约束了离散采样的方差,并防止在未经审计的令牌上发生策略坍缩。在多个竞争性基准测试中,OmniOPD在数学任务上比标准OPD方法提升了高达+28.64%,证实了块级语义验证能够提取比令牌级对数概率匹配更可靠的学习信号,而后者高信息密度的代价是显著的噪声与脆弱性。此外,当与Claude-4.5-Haiku和Gemini-2.5-Flash等更强的黑盒教师模型配合使用时,OmniOPD在数学任务上的相对性能比其基于开源权重教师模型的版本提升了+9.54%,使学生模型的性能超越了自我探索式强化学习。