LoPRo：通过置换分块旋转增强低秩量化 (LoPRo: Enhancing Low-Rank Quantization via Permuted Block-Wise Rotation)

Post-training quantization (PTQ) enables effective model compression while preserving relatively high accuracy. Current weight-only PTQ methods primarily focus on the challenging sub-3-bit regime, where approaches often suffer significant accuracy degradation, typically requiring fine-tuning to achieve competitive performance. In this work, we revisit the fundamental characteristics of weight quantization and analyze the challenges in quantizing the residual matrix under low-rank approximation. We propose LoPRo, a novel fine-tuning-free PTQ algorithm that enhances residual matrix quantization by applying block-wise permutation and Walsh-Hadamard transformations to rotate columns of similar importance, while explicitly preserving the quantization accuracy of the most salient column blocks. Furthermore, we introduce a mixed-precision fast low-rank decomposition based on rank-1 sketch (R1SVD) to further minimize quantization costs. Experiments demonstrate that LoPRo outperforms existing fine-tuning-free PTQ methods at both 2-bit and 3-bit quantization, achieving accuracy comparable to fine-tuning baselines. Specifically, LoPRo achieves state-of-the-art quantization accuracy on LLaMA-2 and LLaMA-3 series models while delivering up to a 4$\times$ speedup. In the MoE model Mixtral-8x7B, LoPRo completes quantization within 2.5 hours, simultaneously reducing perplexity by 0.4$\downarrow$ and improving accuracy by 8\%$\uparrow$. Moreover, compared to other low-rank quantization methods, LoPRo achieves superior accuracy with a significantly lower rank, while maintaining high inference efficiency and minimal additional latency.

翻译：训练后量化（PTQ）能够实现有效的模型压缩，同时保持相对较高的精度。当前的仅权重PTQ方法主要关注具有挑战性的亚3比特量化场景，其中方法通常遭受显著的精度损失，通常需要微调才能达到有竞争力的性能。在本工作中，我们重新审视了权重量化的基本特性，并分析了在低秩近似下量化残差矩阵所面临的挑战。我们提出了LoPRo，一种新颖的无微调PTQ算法，该算法通过应用分块置换和Walsh-Hadamard变换来旋转重要性相似的列，同时显式地保持最显著列块的分块量化精度，从而增强残差矩阵的量化。此外，我们引入了一种基于秩-1草图（R1SVD）的混合精度快速低秩分解方法，以进一步最小化量化成本。实验表明，LoPRo在2比特和3比特量化下均优于现有的无微调PTQ方法，达到了与微调基线相当的精度。具体而言，LoPRo在LLaMA-2和LLaMA-3系列模型上实现了最先进的量化精度，同时带来了高达4$\times$的加速。在MoE模型Mixtral-8x7B中，LoPRo在2.5小时内完成量化，同时将困惑度降低了0.4$\downarrow$，并将准确率提高了8\%$\uparrow$。此外，与其他低秩量化方法相比，LoPRo以显著更低的秩实现了更优的精度，同时保持了高推理效率和最小的额外延迟。