Large language models (LLMs) are crucial in modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it is impractical due to substantial training resources. To address this, we propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm. EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). To the best of our knowledge, Block-AP is the first method to enable direct training of all parameters in a block-wise manner, reducing accuracy loss in low-bit scenarios by enhancing the solution space during optimization. E2E-QP then trains only the quantization parameters (step sizes) end-to-end, further improving the performance of quantized models by considering interactions among all sub-modules. Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits. For instance, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3 points accuracy degradation compared to the full precision (69.48 vs. 72.41). Code is available at https://github.com/OpenGVLab/EfficientQAT.
翻译:大语言模型(LLMs)在现代自然语言处理与人工智能领域具有关键地位,但其巨大的内存需求带来了显著挑战。量化感知训练(QAT)通过低比特表示降低内存占用且精度损失较小,为解决该问题提供了可能,然而其训练资源消耗巨大,实际应用受限。为此,我们提出高效量化感知训练(EfficientQAT),一种更具可行性的QAT算法。EfficientQAT包含两个连续阶段:全参数分块训练(Block-AP)与量化参数端到端训练(E2E-QP)。据我们所知,Block-AP是首个以分块方式直接训练全部参数的方法,通过优化过程中扩展解空间,有效降低了低比特场景下的精度损失。随后,E2E-QP仅端到端训练量化参数(步长),通过考虑所有子模块间的相互作用,进一步提升量化模型的性能。大量实验表明,EfficientQAT在包括基础LLMs、指令调优LLMs以及多模态LLMs在内的多种模型上均优于先前的量化方法,模型规模覆盖7B至70B参数,并支持多种量化比特位宽。例如,EfficientQAT在单张A100-80GB GPU上以41小时获得了2比特的Llama-2-70B模型,其精度相比全精度模型下降不足3个点(69.48 vs. 72.41)。代码已发布于 https://github.com/OpenGVLab/EfficientQAT。