QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

from arxiv, The first three authors contribute equally to this project and are listed in the alphabetical order. Yujun Lin leads the quantization algorithm, Haotian Tang and Shang Yang lead the GPU kernels and the serving system. Code is available at https://github.com/mit-han-lab/qserve

Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by 3x. Code is available at https://github.com/mit-han-lab/qserve.

翻译：量化可加速大语言模型（LLM）推理。在INT8量化之外，研究界正积极探索更低精度（如INT4）。然而，现有最先进的INT4量化技术仅能加速低批量、边缘端LLM推理，无法在高批量、云端LLM服务中带来性能提升。我们揭示了一个关键问题：现有INT4量化方法在GPU上对权重或部分和进行反量化时，会产生显著的运行时开销（20-90%）。为应对这一挑战，我们提出QoQ——一种采用4比特权重、8比特激活值和4比特KV缓存的W4A8KV4量化算法。QoQ源自拉丁语quattuor-octo-quattuor（意为4-8-4）。QoQ通过QServe推理库实现实测加速。驱动QServe的核心洞见在于：GPU上LLM服务的效率受低吞吐量CUDA核心操作的显著影响。基于此洞见，我们在QoQ算法中引入渐进式量化，可降低W4A8通用矩阵乘法（GEMM）的反量化开销。此外，我们开发了SmoothAttention以有效缓解4比特KV量化带来的精度损失。在QServe系统中，我们执行计算感知的权重重排，并利用寄存器级并行性减少反量化延迟。同时，我们使融合注意力机制变为内存受限型，充分发挥KV4量化带来的性能增益。实验结果表明，相较于TensorRT-LLM，QServe在A100上将Llama-3-8B的最大可达到服务吞吐量提升1.2倍，在L40S上提升1.4倍；在A100上将Qwen1.5-72B提升2.4倍，在L40S上提升3.5倍。值得注意的是，L40S GPU上的QServe可实现比A100上TensorRT-LLM更高的吞吐量。因此，QServe有效将LLM服务的资金成本降低3倍。代码已发布于https://github.com/mit-han-lab/qserve。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日