Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by 3x. Code is available at https://github.com/mit-han-lab/qserve.
翻译:量化技术能够加速大语言模型(LLM)推理。在INT8量化之外,研究界正积极探索更低精度的量化方案,例如INT4。然而,现有最先进的INT4量化技术仅能加速低批量、边缘端LLM推理,无法在大批量、云端LLM服务场景中实现性能提升。我们发现了一个关键问题:现有INT4量化方法在GPU上对权重或部分和进行反量化时存在显著的运行时开销(20-90%)。为解决这一挑战,我们提出QoQ——一种采用4比特权重、8比特激活和4比特KV缓存的W4A8KV4量化算法。QoQ源自拉丁语quattuor-octo-quattuor,代表4-8-4。QoQ通过QServe推理库实现并可测量的加速效果。驱动QServe的核心洞察是:GPU上LLM服务的效率关键受限于低吞吐量CUDA核心上的操作。基于此洞察,我们在QoQ算法中引入渐进量化技术,可在W4A8通用矩阵乘法(GEMM)中实现低反量化开销。此外,我们开发了SmoothAttention以有效缓解4比特KV量化导致的精度损失。在QServe系统中,我们执行计算感知的权重重排序,并利用寄存器级并行性降低反量化延迟。我们还将融合注意力操作设计为内存密集型,充分利用KV4量化带来的性能增益。最终,与TensorRT-LLM相比,QServe在A100上将Llama-3-8B的最大可达到服务吞吐量提升1.2倍,在L40S上提升1.4倍;使Qwen1.5-72B在A100上提升2.4倍,在L40S上提升3.5倍。值得注意的是,QServe在L40S GPU上可实现比TensorRT-LLM在A100上更高的吞吐量。因此,QServe有效将LLM服务的美元成本降低3倍。代码见https://github.com/mit-han-lab/qserve。