Large Language Models (LLMs) have intensified the need for low-precision formats that enable efficient, large-scale inference. The Open Compute Project (OCP) Microscaling (MX) standard is attractive due to its favorable hardware efficiency, but its 4-bit variant (MXFP4) lags behind NVIDIA's NVFP4 in accuracy, limiting adoption. We introduce two software-only techniques, Overflow-Aware Scaling (OAS) and Macro Block Scaling (MBS), that improve MXFP4 quantization fidelity without requiring hardware changes. OAS reduces overall errors by increasing effective dynamic range under power-of-two block scaling, while MBS allocates higher-precision scaling at a coarser granularity to better preserve outliers. Across multiple LLMs and standard downstream benchmarks, OAS and MBS reduce the end-to-end accuracy gap between MXFP4 and NVFP4 from about 10% to below 1% on average, while incurring modest GEMM overhead (6.2% on average). These results re-establish MXFP4 as a practical alternative to NVFP4, enabling near-NVFP4 accuracy while retaining MX's hardware-efficiency advantages (e.g., 12% relative area savings in tensor cores).
翻译:大型语言模型(LLMs)对低精度格式的需求日益迫切,以支持高效的大规模推理。开放计算项目(OCP)微缩放(MX)标准因其优越的硬件效率而备受关注,但其4位变体(MXFP4)在精度上落后于NVIDIA的NVFP4,限制了其应用。我们提出了两种纯软件技术——溢出感知缩放(OAS)和宏块缩放(MBS),无需硬件改动即可提升MXFP4量化精度。OAS通过增大二的幂次块缩放下的有效动态范围来降低总体误差,而MBS以更粗粒度分配更高精度缩放以更好地保留离群值。在多个LLM和标准下游基准测试中,OAS和MBS将MXFP4与NVFP4之间的端到端精度差距从约10%平均降至1%以下,同时仅引入适度的GEMM开销(平均6.2%)。这些结果重新确立了MXFP4作为NVFP4实用替代方案的地位,使其在保持MX硬件效率优势(例如张量核心相对面积节省12%)的同时,实现接近NVFP4的精度。