Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.
翻译:上下文蒸馏使语言模型能够将上下文知识内化到其参数中。本文提出面向策略的上下文蒸馏(On-Policy Context Distillation, OPCD)框架,该框架通过学生模型在其自身生成的轨迹上进行训练,同时最小化与上下文条件教师模型之间的反向Kullback-Leibler散度,从而将面向策略蒸馏与上下文蒸馏相结合。我们通过两个重要应用验证了OPCD的有效性:经验知识蒸馏,即模型从其历史求解轨迹中提取并整合可迁移知识;以及系统提示蒸馏,即模型内化编码在优化提示中的有益行为。在数学推理、基于文本的游戏和特定领域任务上,OPCD始终优于基线方法,在实现更高任务精度的同时,更好地保持了分布外能力。我们进一步证明,OPCD能够实现有效的跨规模蒸馏,较小的学生模型可内化较大教师模型的经验知识。