Reasoning models excel at complex tasks such as coding and mathematics, yet their inference is often slow and token-inefficient. To improve the inference efficiency, post-training quantization (PTQ) usually comes with the cost of large accuracy drops, especially for reasoning tasks under low-bit settings. In this study, we present a systematic empirical study of quantization-aware training (QAT) for reasoning models. Our key findings include: (1) Knowledge distillation is a robust objective for reasoning models trained via either supervised fine-tuning or reinforcement learning; (2) PTQ provides a strong initialization for QAT, improving accuracy while reducing training cost; (3) Reinforcement learning remains feasible for quantized models given a viable cold start and yields additional gains; and (4) Aligning the PTQ calibration domain with the QAT training domain accelerates convergence and often improves the final accuracy. Finally, we consolidate these findings into an optimized workflow (Reasoning-QAT), and show that it consistently outperforms state-of-the-art PTQ methods across multiple LLM backbones and reasoning datasets. For instance, on Qwen3-0.6B, it surpasses GPTQ by 44.53% on MATH-500 and consistently recovers performance in the 2-bit regime.
翻译:推理模型在编码和数学等复杂任务上表现出色,但其推理过程通常缓慢且令牌效率低下。为提升推理效率,训练后量化通常以精度大幅下降为代价,在低比特设置下的推理任务中尤为明显。本研究对推理模型的量化感知训练进行了系统性实证研究。我们的主要发现包括:(1) 知识蒸馏是通过监督微调或强化学习训练的推理模型的鲁棒性目标;(2) 训练后量化为量化感知训练提供了强初始化,在降低训练成本的同时提升精度;(3) 在可行的冷启动条件下,强化学习对量化模型仍然可行并能带来额外增益;(4) 将训练后量化的校准域与量化感知训练域对齐可加速收敛并常能提升最终精度。最后,我们将这些发现整合为优化工作流(Reasoning-QAT),并证明其在多个大语言模型骨干和推理数据集上持续优于最先进的训练后量化方法。例如在Qwen3-0.6B模型上,该方法在MATH-500数据集上超越GPTQ达44.53%,并在2比特机制中持续恢复性能。