Quantization has emerged as a standard technique for accelerating inference for generative models by enabling faster low-precision computations and reduced memory transfers. Recently, GPU accelerators have added first-class support for microscaling Block Floating Point (BFP) formats. Standard BFP algorithms use a fixed scale based on the maximum magnitude of the block. We observe that this scale choice can be suboptimal with respect to quantization errors. In this work, we propose ScaleSearch, an alternative strategy for selecting these scale factors: using a fine-grained search leveraging the mantissa bits in microscaling formats to minimize the quantization error for the given distribution. ScaleSearch can be integrated with existing quantization methods such as Post Training Quantization and low-precision attention, and is shown to improve their performance. Additionally, we introduce ScaleSearchAttention, an accelerated NVFP4-based attention algorithm, which uses ScaleSearch and adapted prior techniques to ensure near-0 performance loss for causal language modeling. Experiments show that ScaleSearch reduces quantization error by 27% for NVFP4 and improves language model PTQ by up to 15 points for MATH500 (Qwen3-8B), while ScaleSearchAttention improves Wikitext-2 PPL by upto 0.77 points for Llama 3.1 70B. The proposed methods closely match baseline performance while providing quantization accuracy improvements.
翻译:量化已成为通过支持低精度计算和减少内存传输来加速生成模型推理的标准技术。近期,GPU加速器已新增对微缩放块浮点格式的一流支持。标准BFP算法基于块的最大幅值使用固定缩放因子。我们观察到,这种缩放因子选择在量化误差方面可能并非最优。在本工作中,我们提出ScaleSearch,一种替代性的缩放因子选择策略:利用微缩放格式中的尾数位进行细粒度搜索,以最小化给定分布下的量化误差。ScaleSearch可集成到现有量化方法中(如训练后量化与低精度注意力),并被证明能提升其性能。此外,我们引入ScaleSearchAttention,一种基于NVFP4的加速注意力算法,它结合ScaleSearch与改进的现有技术,确保因果语言建模中性能损失接近为零。实验表明,ScaleSearch将NVFP4的量化误差降低27%,使Qwen3-8B模型在MATH500数据集上的训练后量化性能提升高达15个百分点;而ScaleSearchAttention使Llama 3.1 70B模型在Wikitext-2上的困惑度降低高达0.77点。所提方法在保持基线性能的同时,提供了量化精度改进。