Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing model weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format. When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline. Our code is open-sourced and available online.
翻译:生成式大型语言模型(LLMs)在广泛任务中展现出卓越成果。然而,由于其前所未有的资源需求,部署这些模型进行推理一直是一个重大挑战。这迫使现有部署框架采用多GPU推理管线——此类管线往往复杂且成本高昂,或只能使用规模更小、性能更低的模型。本研究表明,LLM生成式推理的主要瓶颈在于内存带宽而非计算能力,尤其针对单批次推理场景。尽管量化通过降低模型权重精度展现出解决潜力,但先前工作往往导致显著的性能退化。为此,我们提出SqueezeLLM——一种训练后量化框架,不仅能实现无损压缩至3比特的极低精度,还能在相同内存约束下达成更高量化性能。该框架包含两项创新:(i)基于敏感度的非均匀量化——利用二阶信息搜索最优比特精度分配;(ii)稠密-稀疏分解——以高效稀疏格式存储异常值和敏感权重值。当应用于LLaMA模型时,与同等内存需求的最先进方法相比,我们提出的3比特量化可将与FP16基线的困惑度差距缩小达2.1倍。此外,在A6000 GPU上部署时,我们的量化模型相比基线可实现高达2.3倍的加速。相关代码已开源并提供在线访问。