As large language models have grown larger, interest has grown in low-precision numerical formats such as NVFP4 as a way to improve speed and reduce memory usage. However, quantizing models to NVFP4 remains difficult as the lack of precision generally degrades model performance. In this work, we address this issue with Four Over Six (4/6), a modification to the block-scaled NVFP4 quantization algorithm that yields reduced quantization error. Unlike integer formats, floating point formats have non-uniform step sizes which create larger quantization error on larger values. 4/6 takes advantage of this by adaptively scaling some blocks to smaller FP4 values, making the distribution of representable values more uniform and reducing quantization error for near-maximal values. We show that 4/6 can be implemented efficiently on NVIDIA Blackwell GPUs, resulting in performance gains during both pre-training and inference with minimal computational overhead. In pre-training experiments with the Nemotron 3 Nano 30B-A3B model architecture, we find that 4/6 brings training loss closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. Our code is available at http://github.com/mit-han-lab/fouroversix.
翻译:随着大语言模型规模不断增大,低精度数值格式(如NVFP4)因其能提升速度并减少内存使用而备受关注。然而,将模型量化为NVFP4仍具挑战性,因为精度不足通常会导致模型性能下降。本研究通过"四比六"方法解决此问题,该方法是对块缩放NVFP4量化算法的改进,能够降低量化误差。与整数格式不同,浮点格式具有非均匀的步长,这会在较大数值上产生更大的量化误差。4/6方法通过自适应地将部分块缩放至更小的FP4值来利用这一特性,使可表示值的分布更均匀,从而降低接近最大值区域的量化误差。我们证明4/6可在NVIDIA Blackwell GPU上高效实现,在预训练和推理过程中以最小计算开销获得性能提升。基于Nemotron 3 Nano 30B-A3B模型架构的预训练实验表明,与采用当前最先进NVFP4训练方案的模型相比,4/6能使训练损失更接近BF16基准。相关代码已发布于http://github.com/mit-han-lab/fouroversix。