Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 8-12x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.
翻译:知识蒸馏通过压缩教师大型语言模型(LLM)的知识来训练更小的LLM,从而提升大型语言模型的推理能力。在线策略蒸馏通过让学生模型采样自身轨迹,同时由教师LLM提供密集的令牌级监督,改进了这一方法,解决了离线策略蒸馏方法中训练与推理之间的分布失配问题。然而,在线策略蒸馏通常需要一个独立的、往往更大的教师LLM,且未能显式利用推理数据集中可用的真实解。受"足够强大的LLM能够合理化外部特权推理轨迹并教导其较弱自我(即无法访问特权信息的版本)"这一直觉启发,我们提出了在线策略自蒸馏(OPSD)框架。在该框架中,单个模型通过条件化不同上下文同时扮演教师和学生的角色:教师策略以特权信息(如已验证的推理轨迹)为条件,而学生策略仅能看到问题;训练过程通过最小化学生自身推演中这两个分布在每令牌上的散度来实现。我们在多个数学推理基准测试中验证了该方法的有效性,相比GRPO等强化学习方法实现了8-12倍的令牌效率提升,且性能优于离线策略蒸馏方法。