Large Reasoning Models (LRMs) achieve strong problem-solving through long chain-of-thought, but their deployment is constrained by the high cost of full-precision inference and growing KV cache footprints. Microscaled FP4 formats enable efficient FP4 deployment; however, fully quantizing weights, activations, and KV caches (W4A4KV4) causes severe reasoning degradation that existing PTQ and QAT fail to recover. We identify that FP4 failures concentrate on low-entropy tokens--precise symbolic commitments such as digits and operators--where quantization noise inflates sampling errors that cascade through reasoning traces. Based on this insight, we propose ReQAT, a reasoning-centric FP4 training framework with three components: (i) Trace-Aligned QAT (TAQ), which revisits identical reasoning traces to focus updates on critical low-entropy decisions; (ii) Selective Entropy Minimization (SEM), which reinforces confidence at low-entropy positions; and (iii) Q-FIT, a quantization-friendly initialization that jointly calibrates RoPE-consistent KV cache transformations to stabilize QAT. Under the same training budget, ReQAT not only recovers but surpasses BF16 fine-tuning accuracy, while delivering up to 3.9x throughput speedup on NVIDIA DGX Spark and 3.1x on B200.
翻译:大型推理模型(LRMs)通过长思维链实现强问题求解能力,但其部署受限于全精度推理的高昂成本及日益增长的KV缓存占用。微尺度FP4格式支持高效的FP4部署;然而,对权重、激活值和KV缓存进行完全量化(W4A4KV4)会导致严重的推理性能退化,现有PTQ和QAT方法无法恢复。我们识别出FP4错误集中出现在低熵标记——即数字和运算符等精确符号性承诺——量化噪声在此处放大采样误差,并沿推理轨迹级联传播。基于此洞察,我们提出ReQAT,一个以推理为中心的FP4训练框架,包含三个组件:(i)轨迹对齐QAT(TAQ),通过复现相同推理轨迹,聚焦更新关键低熵决策;(ii)选择性熵最小化(SEM),增强低熵位置置信度;(iii)Q-FIT,一种量化友好初始化方法,联合校准RoPE一致的KV缓存变换以稳定QAT。在相同训练预算下,ReQAT不仅恢复且超越BF16微调精度,同时在NVIDIA DGX Spark上实现高达3.9倍吞吐量加速,在B200上实现3.1倍加速。