TIP: Token Importance in On-Policy Distillation

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

翻译：策略知识蒸馏（OPD）通过教师模型在token级别对学生模型自身生成的序列进行监督训练。并非所有token位置同等重要，但现有关于token重要性的认知尚不完整。我们提出一个直接问题：OPD中哪些token携带最有价值的学习信号？答案是信息性token来自两个区域：学生熵值较高的位置，以及学生熵值低但教师-学生分歧大的位置——此处学生表现出过度自信且产生错误。实验表明，学生熵是一个强有力的近似指标：基于熵值采样保留50%的token进行训练，其效果可与全token训练相当甚至更优，同时将峰值内存降低高达47%。但仅依赖熵值会遗漏第二个关键区域。当我们分离出低熵高分歧token时，即使训练样本少于全部token的10%，其效果也接近全token基线，这表明过度自信的token承载密集的修正信号，而这些信号在仅基于熵值的规则中几乎不可见。我们将这些发现归纳为TIP（策略蒸馏中的Token重要性），这是一个基于学生熵值与教师-学生分歧度的双轴分类体系，并给出理论解释说明熵值为何有效却结构不完整。该视角启发了结合不确定性与分歧度的类型感知型token选择规则。我们通过涵盖Qwen3、Llama和Qwen2.5的三种教师-学生模型组合，在MATH-500与AIME 2024/2025数据集以及面向长程智能体规划的DeepPlanning基准上验证了这一观点——其中基于Q3（低熵高分歧区域）策略仅使用<20%的token进行训练，其性能便超越了全token的OPD方法。本实验基于扩展的OPD代码库（https://github.com/HJSang/OPSD_OnPolicyDistillation）实现，该库支持在有限GPU预算下对更大规模模型进行高效内存蒸馏。