Post-training quantization (PTQ) compresses the weights and activations of large language models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitudes across channels and narrow the dynamic range within each quantization group, effectively addressing the outlier issue. We further co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. Under weight-only quantization, ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks, with less than 10% overhead. ParoQuant also matches the accuracy of state-of-the-art weight-activation quantization methods. This paves the way for more efficient and accurate deployment of reasoning LLMs.
翻译:训练后量化(PTQ)通过将大语言模型(LLM)的权重和激活值压缩为低精度表示,以减少内存占用并加速推理。然而,权重和激活值中异常值的存在往往导致较大的量化误差和严重的精度下降,尤其是在近期需要长思维链的推理型LLM中,误差会沿链式推理过程累积。现有的PTQ方法要么未能充分抑制异常值,要么在推理时引入了显著的开销。本文提出成对旋转量化(ParoQuant),这是一种PTQ方法,它将硬件高效且可优化的独立Givens旋转与通道级缩放相结合,以均衡各通道间的幅值并缩小每个量化组内的动态范围,从而有效解决异常值问题。我们进一步协同设计了推理内核,以充分利用GPU并行性,并在运行时保持旋转与缩放操作的轻量级特性。在仅权重量化设置下,ParoQuant在推理任务上相比AWQ平均实现了2.4%的精度提升,且额外开销低于10%。ParoQuant的精度也与当前最先进的权重-激活值联合量化方法相当。这为更高效、更准确地部署推理型LLM铺平了道路。