Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.
翻译:上下文蒸馏使语言模型能够将上下文知识内化至其参数中。在本工作中,我们提出了基于策略的上下文蒸馏(OPCD),该框架通过策略蒸馏与上下文蒸馏的融合,在最小化与上下文条件教师模型的反向Kullback-Leibler散度的同时,利用学生模型自身生成的轨迹对其进行训练。我们展示了OPCD在两个重要应用中的有效性:经验知识蒸馏——模型从其历史求解轨迹中提取并整合可迁移知识;以及系统提示蒸馏——模型将编码在优化提示中的有益行为内化。在数学推理、文本游戏和特定领域任务中,OPCD始终优于基线方法,在实现更高任务准确率的同时,更好地保持了分布外泛化能力。我们进一步证明,OPCD能够实现有效的跨规模蒸馏,使较小的学生模型能够内化来自较大教师模型的经验知识。