With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization (PTQ) techniques for quantizing weights and activations of LLMs still suffer from non-negligible accuracy drops, especially on massive multitask language understanding. To address this issue, we propose Low-Rank Quantization (LRQ) $-$ a simple yet effective post-training weight quantization method for LLMs that reconstructs the outputs of an intermediate Transformer block by leveraging low-rank weight-scaling matrices, replacing the conventional full weight-scaling matrices that entail as many learnable scales as their associated weights. Thanks to parameter sharing via low-rank structure, LRQ only needs to learn significantly fewer parameters while enabling the individual scaling of weights, thus boosting the generalization capability of quantized LLMs. We show the superiority of LRQ over prior LLM PTQ works under (i) $8$-bit weight and per-tensor activation quantization, (ii) $4$-bit weight and $8$-bit per-token activation quantization, and (iii) low-bit weight-only quantization schemes. Our code is available at \url{https://github.com/onliwad101/FlexRound_LRQ} to inspire LLM researchers and engineers.
翻译:随着大语言模型(LLM)的商业化,权重-激活量化技术应运而生,旨在压缩并加速LLM,在降低推理成本的同时实现高吞吐量。然而,现有的用于量化LLM权重和激活的后训练量化(PTQ)技术仍存在不可忽视的精度下降问题,尤其是在大规模多任务语言理解场景中。为解决此问题,我们提出了低秩量化(LRQ)——一种简单而有效的LLM后训练权重量化方法。该方法通过学习低秩权重缩放矩阵来重构中间Transformer模块的输出,从而替代传统的全权重缩放矩阵(后者需要学习与其关联权重数量相同的可调缩放因子)。得益于低秩结构带来的参数共享,LRQ仅需学习显著更少的参数,同时实现对权重的独立缩放,从而提升了量化后LLM的泛化能力。我们在以下三种量化方案中展示了LRQ相较于先前LLM PTQ工作的优越性:(i)8位权重与张量级激活量化,(ii)4位权重与8位令牌级激活量化,以及(iii)低位宽仅权重量化方案。我们的代码已发布于 \url{https://github.com/onliwad101/FlexRound_LRQ},以期为LLM研究人员与工程师提供参考。