Large language models (LLMs) have shown excellent performance on various tasks, but the astronomical model size raises the hardware barrier for serving (memory size) and slows down token generation (memory bandwidth). In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs' generalization ability on different domains and modalities, without overfitting to the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks. Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement an efficient and flexible inference framework tailored for LLMs on the edge, offering more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPU (NVIDIA Jetson Orin 64GB).
翻译:大语言模型在各种任务中展现出卓越性能,但其天文级别的模型规模提高了服务部署的硬件门槛(内存容量),并拖慢了令牌生成速度(内存带宽)。本文提出激活感知权重量化(AWQ),一种面向大语言模型低位权重量化的硬件友好方法。我们的方法基于以下观察:权重并非同等重要——仅保护1%的关键权重即可大幅降低量化误差。随后,我们提出通过观测激活值(而非权重)来搜索最优逐通道缩放因子以保护关键权重。AWQ无需任何反向传播或重建过程,因此能很好保持大语言模型在不同领域和模态上的泛化能力,避免对校准集过拟合。在各种语言建模和领域特定基准测试中,AWQ均优于现有方法。得益于更优的泛化性,它在指令调优语言模型上实现了卓越的量化性能,并首次在多模态语言模型上取得同样效果。伴随AWQ,我们实现了一个为边缘端大语言模型量身定制的高效灵活推理框架,在桌面级和移动级GPU上相较Huggingface FP16实现均带来3倍以上的加速。该框架还将70B参数的Llama-2模型部署推广到了移动级GPU(NVIDIA Jetson Orin 64GB)。