Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3 yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.
翻译:高效蒸馏是将昂贵的推理能力转化为可部署效率的关键途径,然而在模型已具备较强推理能力的前沿场景中,简单的持续蒸馏往往收效甚微甚至导致性能退化。我们观察到一种特征性的训练现象:即使损失函数单调下降,所有性能指标仍可能在几乎同一瓶颈处急剧下降,之后才逐渐恢复。我们进一步揭示了令牌层面的机制:置信度分化为稳定增长的模仿锚定令牌(其能快速锚定优化过程)与其他待学习令牌(其置信度在瓶颈期前受到抑制)。这两类令牌无法共存的特征是持续蒸馏失败的根本原因。为此,我们提出训练轨迹感知的令牌选择方法,在令牌层面重构训练目标,为待学习令牌清除优化路径。该方法在自回归和蒸馏大语言模型设置中均取得稳定增益:仅使用数百个示例,Qwen3-8B在竞争性推理基准上超越DeepSeek-R1,Qwen3-32B逼近Qwen3-235B,而经T3训练的LLaDA-2.0-Mini超越了其自回归基线,在所有16B规模的无思考模型中达到最先进的性能水平。