Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD. To understand this discrepancy, we identify a previously overlooked condition that is critical for any OPD pipeline, which we term teacher consistency. This condition requires that the same teacher model be used for both supervised fine-tuning and OPD. We show that violating teacher consistency introduces an irreducible gradient bias, causing both offline and online OPD to converge to a suboptimal fixed point regardless of training duration. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This design eliminates the need for a live teacher server entirely. We further show that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Extensive experiments on mathematical reasoning and code generation demonstrate that Lightning OPD achieves state-of-the-art performance with significantly improved efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD and substantially lowering the barrier to entry for academic research on LLM post-training.

翻译：在线策略蒸馏（OPD）已成为大语言模型高效后训练的范式。然而，标准OPD需要在训练全程保持实时教师推理服务器，导致显著的基础设施开销。本研究探讨能否通过离线方式实现在线策略蒸馏。一种自然方法是预先计算教师模型在SFT生成样本上的对数概率，并在训练中重复使用。但实践中这种离线变体无法稳定匹配标准OPD的性能。为理解这一差异，我们识别出先前被忽视的关键条件——教师一致性，该条件要求监督微调与OPD阶段使用相同的教师模型。研究表明，违反教师一致性会引入不可约梯度偏差，导致离线与在线OPD均收敛至次优不动点（与训练时长无关）。基于此发现，我们提出闪电OPD框架——通过预计算SFT生成样本上教师对数概率的离线在线策略蒸馏方法，完全消除实时教师服务器需求。进一步论证：在教师一致性条件下，闪电OPD与标准OPD具有相同最优解，且具备有界梯度偏差和抑制策略漂移的隐式正则化效果。在数学推理与代码生成任务上的大量实验表明，闪电OPD在显著提升效率的同时达到最优性能。基于SFT初始化的Qwen3-8B-Base模型，闪电OPD仅需30 GPU小时即实现AIME 2024上69.9%的准确率，较标准OPD获得4.0倍加速，显著降低了LLM后训练学术研究的准入门槛。