Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 4-8x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.
翻译:知识蒸馏通过压缩教师大型语言模型(LLM)的知识来训练更小的LLM,从而提升其推理能力。策略内蒸馏方法通过让学生模型采样自身轨迹,同时由教师LLM提供密集的令牌级监督,解决了策略外蒸馏方法中训练与推理分布不匹配的问题。然而,策略内蒸馏通常需要一个独立且规模更大的教师LLM,且未能显式利用推理数据集中可用的真实解。受“一个能力足够的LLM能够解析外部特权推理轨迹并指导其较弱版本(即无法访问特权信息的模型)”这一直觉启发,我们提出了策略内自蒸馏(OPSD)框架。在该框架中,单一模型通过适配不同上下文环境同时扮演教师和学生的角色:教师策略基于特权信息(如已验证的推理轨迹)进行条件化,而学生策略仅能获取问题信息;训练过程通过最小化学生自身推演轨迹上两种策略分布的每令牌差异来实现。我们在多个数学推理基准测试中验证了该方法的有效性,相比GRPO等强化学习方法实现了4-8倍的令牌效率提升,且性能优于策略外蒸馏方法。