Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self, we introduce On-Policy Self-Distillation (OPSD), a learning algorithm where a single LLM acts as both teacher and student with different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving superior token efficiency compared to reinforcement learning methods and better performance over off-policy distillation methods. Code repo: https://github.com/siyan-zhao/OPSD.
翻译:知识蒸馏通过压缩教师大型语言模型的知识来训练较小的学生模型,从而提升大型语言模型的推理能力。自策略蒸馏通过让学生模型自行生成轨迹,同时由教师模型提供密集的逐词级监督,进一步推进了这一方法,解决了离策略蒸馏方法中训练与推理分布不匹配的问题。然而,自策略蒸馏通常需要一个独立的、通常更大的教师模型,并且未能显式利用推理数据集中已有的真实解答。基于“一个足够强的语言模型能够合理化外部特权推理轨迹并教导自身较弱的版本”这一直觉,我们提出了自策略自蒸馏(OPSD)——一种学习算法,其中同一语言模型在不同上下文中同时扮演教师和学生角色。教师策略以特权信息(如经过验证的推理轨迹)为条件,而学生策略仅看到问题;训练通过最小化学生自身生成轨迹中每个词符的分布差异来优化。我们在多个数学推理基准上验证了该方法的有效性,与强化学习方法相比实现了更高的词符效率,且性能优于离策略蒸馏方法。代码仓库:https://github.com/siyan-zhao/OPSD。