Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by 3x. Code is available at https://github.com/mit-han-lab/qserve.
翻译:量化可加速大语言模型(LLM)推理。在INT8量化之外,研究界正积极探索更低精度(如INT4)。然而,现有最先进的INT4量化技术仅能加速低批量、边缘端LLM推理,无法在高批量、云端LLM服务中带来性能提升。我们揭示了一个关键问题:现有INT4量化方法在GPU上对权重或部分和进行反量化时,会产生显著的运行时开销(20-90%)。为应对这一挑战,我们提出QoQ——一种采用4比特权重、8比特激活值和4比特KV缓存的W4A8KV4量化算法。QoQ源自拉丁语quattuor-octo-quattuor(意为4-8-4)。QoQ通过QServe推理库实现实测加速。驱动QServe的核心洞见在于:GPU上LLM服务的效率受低吞吐量CUDA核心操作的显著影响。基于此洞见,我们在QoQ算法中引入渐进式量化,可降低W4A8通用矩阵乘法(GEMM)的反量化开销。此外,我们开发了SmoothAttention以有效缓解4比特KV量化带来的精度损失。在QServe系统中,我们执行计算感知的权重重排,并利用寄存器级并行性减少反量化延迟。同时,我们使融合注意力机制变为内存受限型,充分发挥KV4量化带来的性能增益。实验结果表明,相较于TensorRT-LLM,QServe在A100上将Llama-3-8B的最大可达到服务吞吐量提升1.2倍,在L40S上提升1.4倍;在A100上将Qwen1.5-72B提升2.4倍,在L40S上提升3.5倍。值得注意的是,L40S GPU上的QServe可实现比A100上TensorRT-LLM更高的吞吐量。因此,QServe有效将LLM服务的资金成本降低3倍。代码已发布于https://github.com/mit-han-lab/qserve。