Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially increases inference costs. NVFP4 inference offers a promising approach to reduce both computational and memory costs through hardware-supported low-precision execution. However, directly applying NVFP4 to LRMs introduces two practical limitations: reasoning accuracy degrades under quantization, and existing NVFP4 kernels do not fully realize latency benefits in small-batch autoregressive decoding. In this work, we analyze the effect of NVFP4 quantization on token-level uncertainty during reasoning. We show that quantization increases incorrect sampling at low-entropy symbolic tokens, while causing over-concentration on a small set of tokens in high-uncertainty reasoning steps. Based on this observation, we propose \textbf{ReSET}, a reasoning-step entropy-based temperature-scaling method that estimates step-level uncertainty online and adapts the decoding temperature using both token-level and step-level entropy signals. To address the latency gap, we further design a CUDA-core small-$M$ NVFP4 kernel for latency-critical autoregressive decoding. Across reasoning benchmarks and model scales, ReSET improves NVFP4 reasoning accuracy by up to $\sim\!$2 points over the NVFP4 baseline. Our CUDA-core small-$M$ kernel further improves latency-critical decoding, delivering up to $2.5\!\times$ kernel-level speedup over NVFP4 vLLM and approximately $2\!\times$ end-to-end decoding speedup over BF16. Code is available at https://github.com/aiha-lab/ReSET.
翻译:大型推理模型(LRMs)通过生成冗长的中间推理链来提升复杂问题的解决能力,但这显著增加了推理成本。NVFP4推理通过硬件支持的低精度执行,为降低计算和内存开销提供了可行方案。然而,将NVFP4直接应用于LRM会引入两个实际限制:量化导致推理精度下降,且现有NVFP4内核在小批量自回归解码中无法充分发挥延迟优势。本研究分析了NVFP4量化对推理过程中词元级不确定性的影响。我们发现,量化会增加低熵符号词元的错误采样概率,同时导致高不确定性推理步骤中词元分布过度集中于少数候选词元。基于此观察,我们提出**ReSET**——一种基于推理步熵的温度缩放方法,通过在线估计步级不确定性,并联合词元级与步级熵信号自适应调节解码温度。为弥补延迟差距,我们进一步设计了面向延迟关键型自回归解码的CUDA核心小批量(小-$M$)NVFP4内核。在多个推理基准与模型规模上,ReSET较NVFP4基线将推理精度提升高达约2个点。我们的CUDA核心小-$M$内核进一步优化了延迟关键型解码,相较NVFP4 vLLM实现高达$2.5\times$的内核级加速,相较BF16实现约$2\times$的端到端解码加速。代码已开源:https://github.com/aiha-lab/ReSET。