On-policy distillation is a promising approach for transferring knowledge between language models, where a student learns from dense token-level signals along its own trajectories. This framework typically uses reverse KL divergence, encouraging the student to match the teacher's high-confidence predictions. However, we show that the mode-seeking property of reverse KL reduces generation diversity and yields unstable learning signals when the teacher distribution has high entropy. To address this, we introduce Entropy-Aware On-Policy Distillation. Our key idea is augmenting the standard reverse KL objective with forward KL when teacher entropy is high, capturing the full range of plausible outputs while retaining precise imitation elsewhere. It balances mode-seeking precision with mode-covering robustness without sacrificing on-policy training efficiency. Experiments show that our method maintains generation diversity (sustained token-level entropy) and improves student-teacher alignment (lower forward KL on high-entropy tokens). Across six math reasoning benchmarks, this yields Pass@8 accuracy gains of +1.37 for Qwen3-0.6B-Base, +2.39 for Qwen3-1.7B-Base, and +5.05 for Qwen3-4B-Base compared to baseline on-policy distillation methods. These results demonstrate that accounting for teacher uncertainty is essential for maintaining diversity and achieving effective knowledge transfer.
翻译:在线策略蒸馏是一种有前景的语言模型知识迁移方法,其中学生模型沿着自身轨迹学习密集的token级信号。该框架通常使用反向KL散度,促使学生模型匹配教师模型的高置信度预测。然而,我们发现反向KL散度的模态追求特性会降低生成多样性,并在教师分布具有高熵时产生不稳定的学习信号。为此,我们提出熵感知的在线策略蒸馏方法。其核心思想是:当教师熵较高时,在前向KL散度基础上增强标准反向KL目标,在保留精确模仿能力的同时捕获全部可能的输出范围。该方法在不牺牲在线策略训练效率的前提下,平衡了模态追求的准确性与模态覆盖的鲁棒性。实验表明,我们的方法保持了生成多样性(维持token级熵值),并改善了学生-教师对齐(高熵token上的前向KL散度更低)。在六个数学推理基准测试中,与基线在线策略蒸馏方法相比,该方法在Pass@8准确率上分别提升Qwen3-0.6B-Base 1.37%、Qwen3-1.7B-Base 2.39%、Qwen3-4B-Base 5.05%。这些结果证明,考虑教师不确定性对于维持多样性和实现有效知识迁移至关重要。