Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs. However, existing PTQ methods only focus on handling the outliers within one layer or one block, which ignores the dependency of blocks and leads to severe performance degradation in low-bit settings. In this paper, we propose CBQ, a cross-block reconstruction-based PTQ method for LLMs. CBQ employs a cross-block dependency using a homologous reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation. Furthermore, CBQ incorporates a coarse-to-fine preprocessing (CFP) strategy for suppressing weight and activation outliers, coupled with an adaptive LoRA-Rounding technique for precise weight quantization. These innovations enable CBQ to not only handle extreme outliers effectively but also improve overall quantization accuracy. Extensive experiments show that CBQ achieves superior low-bit quantization (W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods across various LLMs and datasets. Notably, CBQ quantizes the 4-bit LLAMA1-65B model within only 4.3 hours on a single GPU, achieving a commendable tradeoff between performance and quantization efficiency.
翻译:训练后量化(PTQ)在超低成本压缩大语言模型(LLMs)中发挥了关键作用。然而,现有PTQ方法仅关注处理单层或单块内部的离群值,忽视了块间的依赖性,导致低位设置下性能严重下降。本文提出CBQ——一种基于跨块重构的LLM训练后量化方法。CBQ采用同源重构方案建立跨块依赖关系,通过构建多个块间的长程依赖来最小化误差累积。此外,CBQ融合了粗到细预处理(CFP)策略以抑制权重与激活值离群,并配合自适应LoRA-Rounding技术实现精确的权重量化。这些创新使CBQ不仅能有效处理极端离群值,还能全面提升量化精度。大量实验表明,CBQ在低位量化(W4A4、W4A8、W2A16)中表现优异,在多种LLM和数据集上均优于现有最先进方法。值得注意的是,CBQ在单GPU上仅需4.3小时即可完成4比特LLAMA1-65B模型的量化,实现了性能与量化效率的良好平衡。