Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.
翻译:大语言模型(LLMs)在自然语言处理领域表现出色,但其高昂的计算需求阻碍了大规模部署。虽然量化感知训练(QAT)提供了一种解决方案,但其高昂的训练成本使得训练后量化(PTQ)成为LLMs更实用的方法。现有研究表明,特定通道中的激活异常值是制约PTQ精度的瓶颈。研究者提出将激活值幅度转移至权重的方法,但该方法缓解效果有限或存在梯度不稳定的问题,导致低位宽场景下性能显著下降。本文提出QLLM,一种专为LLMs设计的精准高效低位宽PTQ方法。QLLM引入自适应通道重组技术,将异常值的幅度重新分配至其他通道,从而减轻其对量化范围的影响。该技术通过通道分解与通道重组实现:首先将异常通道拆分为若干子通道,使激活幅度分布更加均衡;随后合并相似通道以保持原始通道数量,确保计算效率。此外,我们设计了自适应策略,可自主确定通道分解的最优子通道数量。为进一步补偿量化导致的性能损失,我们提出一种高效微调方法:仅学习少量低秩权重,同时冻结预训练量化模型。训练完成后,这些低秩参数可融合至冻结权重中,不影响推理效率。在LLaMA-1与LLaMA-2上的大量实验表明,QLLM能够高效获得精准的量化模型。例如,在单张A100-80G GPU上,QLLM可在10小时内完成LLaMA-2-70B的4位量化,在五项零样本任务的平均准确率上超越此前最优方法7.89%。