As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency. In this work, we evaluate HiFloat (HiF8 and HiF4), a family of formats tailored for Ascend NPUs. Through rigorous comparison across weight-activation and KV-cache tasks, we provide three key insights: (1) INT8 suits narrow-range data, while floating-point formats excel with high-variance data; (2) in 4-bit regimes, HiF4's hierarchical scaling prevents the accuracy collapse seen in integer formats; and (3) HiFloat is fully compatible with state-of-the-art post-training quantization frameworks. Overall, HiFloat provides a solution for high-efficiency LLM inference on NPUs.
翻译:随着大语言模型规模的扩大,MXFP和NVFP4等低比特浮点格式为精度与效率的平衡提供了新的机遇。本研究针对专为昇腾NPU设计的HiFloat格式族(HiF8与HiF4)进行系统性评估。通过对权重-激活和KV缓存任务的严格对比实验,我们得出三项核心结论:(1)INT8适用于数值范围较窄的数据,而浮点格式在高方差数据场景中表现更优;(2)在4比特量化场景下,HiF4的分层缩放机制能有效避免整数格式常见的精度崩溃现象;(3)HiFloat与当前最先进的训练后量化框架完全兼容。综合而言,HiFloat为NPU上的高效大语言模型推理提供了一套完整的解决方案。