Large language models (LLMs) have shown excellent performance on various tasks, but the astronomical model size raises the hardware barrier for serving (memory size) and slows down token generation (memory bandwidth). In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs' generalization ability on different domains and modalities, without overfitting to the calibration set; it also does not rely on any data layout reordering, maintaining the hardware efficiency. AWQ outperforms existing work on various language modeling, common sense QA, and domain-specific benchmarks. Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. We also implement efficient tensor core kernels with reorder-free online dequantization to accelerate AWQ, achieving a 1.45x speedup over GPTQ and is 1.85x faster than the cuBLAS FP16 implementation. Our method provides a turn-key solution to compress LLMs to 3/4 bits for efficient deployment.
翻译:大语言模型(LLMs)在各类任务中展现出卓越性能,但其庞大的模型规模抬高了服务部署的硬件门槛(内存需求),并降低了令牌生成速度(内存带宽限制)。本文提出激活感知权重量化(AWQ),一种面向LLM低位宽纯权重量化的硬件友好型方法。该方法基于权重重要性不均等的观察:仅保护1%的关键权重即可大幅降低量化误差。我们进一步提出通过观测激活值而非权重,搜索最优逐通道缩放因子来保护关键权重。AWQ无需依赖任何反向传播或重构过程,因此能有效保持LLM在不同领域与模态下的泛化能力,避免对校准集的过拟合;同时无需重排数据布局,维持了硬件计算效率。在各类语言建模、常识问答及领域特定基准测试中,AWQ均优于现有方法。凭借更优的泛化性能,该方法在指令微调语言模型及首个多模态语言模型上均实现了优异的量化效果。我们还设计了无损序在线反量化的高效张量核心算子以加速AWQ,相比GPTQ实现1.45倍加速,较cuBLAS FP16实现达到1.85倍加速。本方法提供了一种将LLM压缩至3/4比特的即用型解决方案,支持高效部署。