Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs. However, existing PTQ methods only focus on handling the outliers within one layer or one block, which ignores the dependency of blocks and leads to severe performance degradation in low-bit settings. In this paper, we propose CBQ, a cross-block reconstruction-based PTQ method for LLMs. CBQ employs a cross-block dependency using a homologous reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation. Furthermore, CBQ incorporates a coarse-to-fine preprocessing (CFP) strategy for suppressing weight and activation outliers, coupled with an adaptive LoRA-Rounding technique for precise weight quantization. These innovations enable CBQ to not only handle extreme outliers effectively but also improve overall quantization accuracy. Extensive experiments show that CBQ achieves superior low-bit quantization (W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods across various LLMs and datasets. Notably, CBQ quantizes the 4-bit LLAMA1-65B model within only 4.3 hours on a single GPU, achieving a commendable tradeoff between performance and quantization efficiency.
翻译:后训练量化(PTQ)在极低资源下压缩大语言模型(LLMs)中发挥了关键作用。然而,现有PTQ方法仅关注处理单层或单块内的异常值,忽视了块间的依赖性,导致低比特场景下性能严重退化。本文提出CBQ——一种基于跨块重构的LLMs后训练量化方法。CBQ采用同源重构方案建立跨块依赖关系,通过构建多个块间的长程依赖来最小化误差累积。此外,CBQ引入从粗到细的预处理(CFP)策略抑制权重和激活异常值,并配合自适应LoRA-Rounding技术实现精确的权重量化。这些创新使得CBQ不仅能有效处理极端异常值,还能提升整体量化精度。大量实验表明,CBQ在低比特量化(W4A4、W4A8、W2A16)中表现优越,在多种LLMs和数据集上均优于现有最优方法。值得注意的是,CBQ在单GPU上仅需4.3小时即可完成4比特LLAMA1-65B模型的量化,实现了性能与量化效率的良好平衡。