QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

from arxiv, The first three authors contribute equally to this project and are listed in the alphabetical order. Yujun Lin leads the quantization algorithm, Haotian Tang and Shang Yang lead the GPU kernels and the serving system. Code is available at https://github.com/mit-han-lab/qserve

Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by 3x. Code is available at https://github.com/mit-han-lab/qserve.

翻译：量化技术能够加速大语言模型（LLM）推理。在INT8量化之外，研究界正积极探索更低精度的量化方案，例如INT4。然而，现有最先进的INT4量化技术仅能加速低批量、边缘端LLM推理，无法在大批量、云端LLM服务场景中实现性能提升。我们发现了一个关键问题：现有INT4量化方法在GPU上对权重或部分和进行反量化时存在显著的运行时开销（20-90%）。为解决这一挑战，我们提出QoQ——一种采用4比特权重、8比特激活和4比特KV缓存的W4A8KV4量化算法。QoQ源自拉丁语quattuor-octo-quattuor，代表4-8-4。QoQ通过QServe推理库实现并可测量的加速效果。驱动QServe的核心洞察是：GPU上LLM服务的效率关键受限于低吞吐量CUDA核心上的操作。基于此洞察，我们在QoQ算法中引入渐进量化技术，可在W4A8通用矩阵乘法（GEMM）中实现低反量化开销。此外，我们开发了SmoothAttention以有效缓解4比特KV量化导致的精度损失。在QServe系统中，我们执行计算感知的权重重排序，并利用寄存器级并行性降低反量化延迟。我们还将融合注意力操作设计为内存密集型，充分利用KV4量化带来的性能增益。最终，与TensorRT-LLM相比，QServe在A100上将Llama-3-8B的最大可达到服务吞吐量提升1.2倍，在L40S上提升1.4倍；使Qwen1.5-72B在A100上提升2.4倍，在L40S上提升3.5倍。值得注意的是，QServe在L40S GPU上可实现比TensorRT-LLM在A100上更高的吞吐量。因此，QServe有效将LLM服务的美元成本降低3倍。代码见https://github.com/mit-han-lab/qserve。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日