ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially increases inference costs. NVFP4 inference offers a promising approach to reduce both computational and memory costs through hardware-supported low-precision execution. However, directly applying NVFP4 to LRMs introduces two practical limitations: reasoning accuracy degrades under quantization, and existing NVFP4 kernels do not fully realize latency benefits in small-batch autoregressive decoding. In this work, we analyze the effect of NVFP4 quantization on token-level uncertainty during reasoning. We show that quantization increases incorrect sampling at low-entropy symbolic tokens, while causing over-concentration on a small set of tokens in high-uncertainty reasoning steps. Based on this observation, we propose \textbf{ReSET}, a reasoning-step entropy-based temperature-scaling method that estimates step-level uncertainty online and adapts the decoding temperature using both token-level and step-level entropy signals. To address the latency gap, we further design a CUDA-core small-$M$ NVFP4 kernel for latency-critical autoregressive decoding. Across reasoning benchmarks and model scales, ReSET improves NVFP4 reasoning accuracy by up to $\sim\!$2 points over the NVFP4 baseline. Our CUDA-core small-$M$ kernel further improves latency-critical decoding, delivering up to $2.5\!\times$ kernel-level speedup over NVFP4 vLLM and approximately $2\!\times$ end-to-end decoding speedup over BF16. Code is available at https://github.com/aiha-lab/ReSET.

翻译：大型推理模型（LRMs）通过生成冗长的中间推理链来提升复杂问题的解决能力，但这显著增加了推理成本。NVFP4推理通过硬件支持的低精度执行，为降低计算和内存开销提供了可行方案。然而，将NVFP4直接应用于LRM会引入两个实际限制：量化导致推理精度下降，且现有NVFP4内核在小批量自回归解码中无法充分发挥延迟优势。本研究分析了NVFP4量化对推理过程中词元级不确定性的影响。我们发现，量化会增加低熵符号词元的错误采样概率，同时导致高不确定性推理步骤中词元分布过度集中于少数候选词元。基于此观察，我们提出**ReSET**——一种基于推理步熵的温度缩放方法，通过在线估计步级不确定性，并联合词元级与步级熵信号自适应调节解码温度。为弥补延迟差距，我们进一步设计了面向延迟关键型自回归解码的CUDA核心小批量（小-$M$）NVFP4内核。在多个推理基准与模型规模上，ReSET较NVFP4基线将推理精度提升高达约2个点。我们的CUDA核心小-$M$内核进一步优化了延迟关键型解码，相较NVFP4 vLLM实现高达$2.5\times$的内核级加速，相较BF16实现约$2\times$的端到端解码加速。代码已开源：https://github.com/aiha-lab/ReSET。