AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Large language models (LLMs) have fundamentally transformed the capabilities of numerous applications, from natural language processing to more intricate domain-specific tasks in robotics and autonomous driving. Moreover, the importance of on-device LLMs has grown significantly in the recent years. Running LLMs on edge devices not only promises reduced latency and improved user experience but also aligns with the increasing need for user privacy, as data processing can occur locally. However, the astronomical model sizes of modern LLMs and constraints of the edge devices, primarily in terms of memory size and bandwidth, pose significant deployment challenges. In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs' generalization ability on different domains and modalities, without overfitting to the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for on-device LLM/VLMs, offering more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.

翻译：大语言模型（LLMs）已从根本上改变了从自然语言处理到机器人、自动驾驶等复杂领域任务的众多应用能力。近年来，端侧大语言模型的重要性显著提升。在边缘设备上运行大语言模型不仅能降低延迟、改善用户体验，更因数据处理可在本地完成而契合日益增长的隐私保护需求。然而，现代大语言模型天文级的模型规模与边缘设备在内存容量和带宽方面的约束构成了严峻的部署挑战。本文提出激活感知权重量化（AWQ），一种面向大语言模型低位权重量化的硬件友好方法。该方法基于权重重要性存在差异的观察：仅保护1%的关键权重即可大幅降低量化误差。我们进而提出通过观察激活值而非权重来搜索最优逐通道缩放因子，从而保护关键权重。AWQ不依赖任何反向传播或重构过程，因此能良好保持大语言模型在不同领域与模态下的泛化能力，避免对校准集的过拟合。在各类语言建模及领域特定基准测试（编程与数学）中，AWQ均优于现有方法。得益于更优的泛化性能，该方法在指令微调语言模型上取得了卓越的量化表现，并首次在多模态语言模型上实现优异量化效果。与AWQ配套，我们实现了TinyChat——一个面向端侧大语言模型/视觉语言模型（LLM/VLM）的高效灵活推理框架，在桌面和移动GPU上相比Huggingface FP16实现获得超过3倍加速，同时首次实现70B Llama-2模型在移动GPU上的部署。