Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and memory efficient. Quantization-aware training (QAT) methods, generally produce the best quantized performance, however it comes at the cost of potentially long training time and excessive memory usage, making it impractical when applying for LLMs. Inspired by parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) literature, we propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs. LR-QAT employs several components to save memory without sacrificing predictive performance: (a) low-rank auxiliary weights that are aware of the quantization grid; (b) a downcasting operator using fixed-point or double-packed integers and (c) checkpointing. Unlike most related work, our method (i) is inference-efficient, leading to no additional overhead compared to traditional PTQ; (ii) can be seen as a general extended pretraining framework, meaning that the resulting model can still be utilized for any downstream task afterwards; (iii) can be applied across a wide range of quantization settings, such as different choices quantization granularity, activation quantization, and seamlessly combined with many PTQ techniques. We apply LR-QAT to LLaMA-2/3 and Mistral model families and validate its effectiveness on several downstream tasks. Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage. Specifically, we can train a 7B LLM on a single consumer grade GPU with 24GB of memory.
翻译:大语言模型(LLMs)已无处不在,然而其持续增长的计算与内存需求使其实际部署面临挑战。量化是提升其计算与内存效率的最有效途径之一。量化感知训练(QAT)方法通常能产生最佳的量化性能,但这往往以较长的训练时间和过高的内存占用为代价,使其在应用于大语言模型时不切实际。受参数高效微调(PEFT)与低秩自适应(LoRA)相关研究的启发,我们提出LR-QAT——一种面向大语言模型的轻量级且内存高效的QAT算法。LR-QAT采用多种组件在保持预测性能的同时节省内存:(a)感知量化网格的低秩辅助权重;(b)使用定点或双打包整数的向下转换算子;以及(c)检查点技术。与大多数相关工作不同,我们的方法(i)具有推理高效性,相比传统的训练后量化(PTQ)不会引入额外开销;(ii)可视为一种通用的扩展预训练框架,意味着所得模型后续仍可用于任何下游任务;(iii)可适用于广泛的量化设置,例如不同的量化粒度选择、激活量化,并能与多种PTQ技术无缝结合。我们将LR-QAT应用于LLaMA-2/3和Mistral模型系列,并在多个下游任务上验证其有效性。我们的方法优于常见的训练后量化(PTQ)方案,并以远低于全模型QAT的内存占用达到与之相当的模型性能。具体而言,我们可在单个内存为24GB的消费级GPU上训练一个7B参数的大语言模型。