The emergence of fine-grained numerical formats like NVFP4 presents new opportunities for efficient Large Language Model (LLM) inference. However, it is difficult to adapt existing Post-Training Quantization (PTQ) strategies to these formats: rotation-based methods compromise fine-grained block isolation; smoothing techniques struggle with significant 4-bit quantization errors; and mixed-precision approaches often conflict with hardware constraints on unified-precision computation. To address these challenges, we propose ARCQuant, a framework that boosts NVFP4 performance via Augmented Residual Channels. Distinct from methods that compromise block isolation or hardware uniformity, ARCQuant maintains a strictly unified NVFP4 format by augmenting the activation matrix with quantized residual channels. This design integrates the error compensation process directly into the matrix reduction dimension, enabling the use of standard, highly optimized GEMM kernels with minimal overhead. Theoretical analysis confirms that the worst-case error bound of our dual-stage NVFP4 quantization is comparable to that of standard 8-bit formats such as MXFP8. Extensive experiments on LLaMA and Qwen models demonstrate that ARCQuant achieves state-of-the-art accuracy, comparable to full-precision baselines in perplexity and downstream tasks. Furthermore, deployment on RTX 5090 and RTX PRO 6000 GPUs confirms practical benefits, achieving up to 3x speedup over FP16. Our code is available at https://github.com/actypedef/ARCQuant .
翻译:NVFP4等细粒度数值格式的出现为高效大型语言模型(LLM)推理提供了新机遇。然而,现有训练后量化(PTQ)策略难以适配此类格式:基于旋转的方法会破坏细粒度块隔离;平滑技术难以处理显著的4位量化误差;混合精度方案则常与硬件对统一精度计算的约束相冲突。为解决这些挑战,我们提出ARCQuant框架,通过增强残差通道提升NVFP4性能。与破坏块隔离或硬件统一性的方法不同,ARCQuant通过向激活矩阵添加量化残差通道,严格保持统一的NVFP4格式。该设计将误差补偿过程直接集成至矩阵归约维度,使得标准且高度优化的GEMM核函数能以最小开销被直接使用。理论分析证实,我们提出的双阶段NVFP4量化在最坏情况下的误差边界与MXFP8等标准8位格式相当。在LLaMA和Qwen模型上的大量实验表明,ARCQuant在困惑度和下游任务中达到了与全精度基线相当的先进精度水平。此外,在RTX 5090和RTX PRO 6000 GPU上的部署验证了其实用优势,相比FP16实现了最高3倍的加速比。代码已开源:https://github.com/actypedef/ARCQuant。