The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization setups in the serving context. Atom improves end-to-end throughput by up to $7.73\times$ compared to the FP16 and by $2.53\times$ compared to INT8 quantization, while maintaining the same latency target.
翻译:大语言模型在内容生成、智能聊天机器人、情感分析等应用中的需求日益增长,给LLM服务提供商带来了巨大挑战。为高效利用GPU资源并提升吞吐量,批量处理多请求已成为主流范式;为进一步加速批处理,LLM量化技术通过降低内存消耗并提升计算能力来实现目标。然而,现有量化方案(如8位权重-激活量化)无法充分利用现代GPU的计算能力(例如4位整数运算单元),导致性能次优。为最大化LLM服务吞吐量,我们提出Atom——一种低比特量化方法,在实现吞吐量大幅提升的同时保持极小精度损失。Atom通过采用低比特运算符显著提升服务吞吐量,并借助低比特量化大幅降低内存消耗。通过应用新型混合精度与细粒度量化流程,Atom实现了高精度。我们在服务场景下基于4位权重-激活量化配置对Atom进行评估。相比FP16与INT8量化,Atom在保持相同延迟目标的前提下,分别实现了最高$7.73\times$与$2.53\times$的端到端吞吐量提升。