Post-training quantization (PTQ) enables effective model compression while preserving relatively high accuracy. Current weight-only PTQ methods primarily focus on the challenging sub-3-bit regime, where approaches often suffer significant accuracy degradation, typically requiring fine-tuning to achieve competitive performance. In this work, we revisit the fundamental characteristics of weight quantization and analyze the challenges in quantizing the residual matrix under low-rank approximation. We propose LoPRo, a novel fine-tuning-free PTQ algorithm that enhances residual matrix quantization by applying block-wise permutation and Walsh-Hadamard transformations to rotate columns of similar importance, while explicitly preserving the quantization accuracy of the most salient column blocks. Furthermore, we introduce a mixed-precision fast low-rank decomposition based on rank-1 sketch (R1SVD) to further minimize quantization costs. Experiments demonstrate that LoPRo outperforms existing fine-tuning-free PTQ methods at both 2-bit and 3-bit quantization, achieving accuracy comparable to fine-tuning baselines. Specifically, LoPRo achieves state-of-the-art quantization accuracy on LLaMA-2 and LLaMA-3 series models while delivering up to a 4$\times$ speedup. In the MoE model Mixtral-8x7B, LoPRo completes quantization within 2.5 hours, simultaneously reducing perplexity by 0.4$\downarrow$ and improving accuracy by 8\%$\uparrow$. Moreover, compared to other low-rank quantization methods, LoPRo achieves superior accuracy with a significantly lower rank, while maintaining high inference efficiency and minimal additional latency.
翻译:训练后量化(PTQ)能够实现有效的模型压缩,同时保持相对较高的精度。当前的仅权重PTQ方法主要关注具有挑战性的亚3比特量化场景,其中方法通常遭受显著的精度损失,通常需要微调才能达到有竞争力的性能。在本工作中,我们重新审视了权重量化的基本特性,并分析了在低秩近似下量化残差矩阵所面临的挑战。我们提出了LoPRo,一种新颖的无微调PTQ算法,该算法通过应用分块置换和Walsh-Hadamard变换来旋转重要性相似的列,同时显式地保持最显著列块的分块量化精度,从而增强残差矩阵的量化。此外,我们引入了一种基于秩-1草图(R1SVD)的混合精度快速低秩分解方法,以进一步最小化量化成本。实验表明,LoPRo在2比特和3比特量化下均优于现有的无微调PTQ方法,达到了与微调基线相当的精度。具体而言,LoPRo在LLaMA-2和LLaMA-3系列模型上实现了最先进的量化精度,同时带来了高达4$\times$的加速。在MoE模型Mixtral-8x7B中,LoPRo在2.5小时内完成量化,同时将困惑度降低了0.4$\downarrow$,并将准确率提高了8\%$\uparrow$。此外,与其他低秩量化方法相比,LoPRo以显著更低的秩实现了更优的精度,同时保持了高推理效率和最小的额外延迟。