Low-Rank Quantization-Aware Training for LLMs

Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and memory efficient. Quantization-aware training (QAT) methods, generally produce the best quantized performance, however it comes at the cost of potentially long training time and excessive memory usage, making it impractical when applying for LLMs. Inspired by parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) literature, we propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs. LR-QAT employs several components to save memory without sacrificing predictive performance: (a) low-rank auxiliary weights that are aware of the quantization grid; (b) a downcasting operator using fixed-point or double-packed integers and (c) checkpointing. Unlike most related work, our method (i) is inference-efficient, leading to no additional overhead compared to traditional PTQ; (ii) can be seen as a general extended pretraining framework, meaning that the resulting model can still be utilized for any downstream task afterwards; (iii) can be applied across a wide range of quantization settings, such as different choices quantization granularity, activation quantization, and seamlessly combined with many PTQ techniques. We apply LR-QAT to LLaMA-2/3 and Mistral model families and validate its effectiveness on several downstream tasks. Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage. Specifically, we can train a 7B LLM on a single consumer grade GPU with 24GB of memory.

翻译：大语言模型（LLMs）已无处不在，然而其持续增长的计算与内存需求使其实际部署面临挑战。量化是提升其计算与内存效率的最有效途径之一。量化感知训练（QAT）方法通常能产生最佳的量化性能，但这往往以较长的训练时间和过高的内存占用为代价，使其在应用于大语言模型时不切实际。受参数高效微调（PEFT）与低秩自适应（LoRA）相关研究的启发，我们提出LR-QAT——一种面向大语言模型的轻量级且内存高效的QAT算法。LR-QAT采用多种组件在保持预测性能的同时节省内存：（a）感知量化网格的低秩辅助权重；（b）使用定点或双打包整数的向下转换算子；以及（c）检查点技术。与大多数相关工作不同，我们的方法（i）具有推理高效性，相比传统的训练后量化（PTQ）不会引入额外开销；（ii）可视为一种通用的扩展预训练框架，意味着所得模型后续仍可用于任何下游任务；（iii）可适用于广泛的量化设置，例如不同的量化粒度选择、激活量化，并能与多种PTQ技术无缝结合。我们将LR-QAT应用于LLaMA-2/3和Mistral模型系列，并在多个下游任务上验证其有效性。我们的方法优于常见的训练后量化（PTQ）方案，并以远低于全模型QAT的内存占用达到与之相当的模型性能。具体而言，我们可在单个内存为24GB的消费级GPU上训练一个7B参数的大语言模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日