EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Large language models (LLMs) are crucial in modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it is impractical due to substantial training resources. To address this, we propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm. EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). To the best of our knowledge, Block-AP is the first method to enable direct training of all parameters in a block-wise manner, reducing accuracy loss in low-bit scenarios by enhancing the solution space during optimization. E2E-QP then trains only the quantization parameters (step sizes) end-to-end, further improving the performance of quantized models by considering interactions among all sub-modules. Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits. For instance, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3 points accuracy degradation compared to the full precision (69.48 vs. 72.41). Code is available at https://github.com/OpenGVLab/EfficientQAT.

翻译：大语言模型（LLMs）在现代自然语言处理与人工智能领域具有关键地位，但其巨大的内存需求带来了显著挑战。量化感知训练（QAT）通过低比特表示降低内存占用且精度损失较小，为解决该问题提供了可能，然而其训练资源消耗巨大，实际应用受限。为此，我们提出高效量化感知训练（EfficientQAT），一种更具可行性的QAT算法。EfficientQAT包含两个连续阶段：全参数分块训练（Block-AP）与量化参数端到端训练（E2E-QP）。据我们所知，Block-AP是首个以分块方式直接训练全部参数的方法，通过优化过程中扩展解空间，有效降低了低比特场景下的精度损失。随后，E2E-QP仅端到端训练量化参数（步长），通过考虑所有子模块间的相互作用，进一步提升量化模型的性能。大量实验表明，EfficientQAT在包括基础LLMs、指令调优LLMs以及多模态LLMs在内的多种模型上均优于先前的量化方法，模型规模覆盖7B至70B参数，并支持多种量化比特位宽。例如，EfficientQAT在单张A100-80GB GPU上以41小时获得了2比特的Llama-2-70B模型，其精度相比全精度模型下降不足3个点（69.48 vs. 72.41）。代码已发布于 https://github.com/OpenGVLab/EfficientQAT。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日