LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in ultra-low precisions, such as sub-4-bit. In this work, we present KVQuant, which addresses this problem by incorporating novel methods for quantizing cached KV activations, including: (i) Per-Channel Key Quantization, where we adjust the dimension along which we quantize the Key activations to better match the distribution; (ii) Pre-RoPE Key Quantization, where we quantize Key activations before the rotary positional embedding to mitigate its impact on quantization; (iii) Non-Uniform KV Cache Quantization, where we derive per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions; (iv) Per-Vector Dense-and-Sparse Quantization, where we isolate outliers separately for each vector to minimize skews in quantization ranges; and (v) Q-Norm, where we normalize quantization centroids in order to mitigate distribution shift, providing additional benefits for 2-bit quantization. By applying our method to the LLaMA, LLaMA-2, and Mistral models, we achieve $<0.1$ perplexity degradation with 3-bit quantization on both Wikitext-2 and C4, outperforming existing approaches. Our method enables serving the LLaMA-7B model with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system.
翻译:大语言模型在文档分析与摘要等需要大上下文窗口的应用中日益普及,而随着上下文窗口增大,KV缓存激活值成为推理过程中内存消耗的主要来源。量化是压缩KV缓存激活值的一种有前景的方法,但现有方案在亚4比特等超低精度下难以准确表征激活值分布。本文提出KVQuant,通过引入以下创新方法解决该问题:(i) 逐通道键量化,通过调整键激活值的量化维度以更好匹配数据分布;(ii) 预旋转位置编码键量化,在旋转位置嵌入前对键激活值进行量化以减轻其影响;(iii) 非均匀KV缓存量化,推导逐层敏感度加权的非均匀数据类型以更优表征分布;(iv) 逐向量稠密-稀疏量化,为每个向量独立分离异常值以最小化量化范围偏移;(v) Q-Norm,通过量化质心归一化缓解分布偏移,在2比特量化中展现额外优势。将本方法应用于LLaMA、LLaMA-2和Mistral模型后,我们在Wikitext-2和C4数据集上实现了3比特量化下困惑度退化小于0.1,性能超越现有方法。该方法支持在单张A100-80GB GPU上为LLaMA-7B模型提供百万级上下文长度服务,并在8GPU系统中实现千万级上下文长度推理。