The emergence of LLMs has ignited a fresh surge of breakthroughs in NLP applications, particularly in domains such as question-answering systems and text generation. As the need for longer context grows, a significant bottleneck in model deployment emerges due to the linear expansion of the Key-Value (KV) cache with the context length. Existing methods primarily rely on various hypotheses, such as sorting the KV cache based on attention scores for replacement or eviction, to compress the KV cache and improve model throughput. However, heuristics used by these strategies may wrongly evict essential KV cache, which can significantly degrade model performance. In this paper, we propose QAQ, a Quality Adaptive Quantization scheme for the KV cache. We theoretically demonstrate that key cache and value cache exhibit distinct sensitivities to quantization, leading to the formulation of separate quantization strategies for their non-uniform quantization. Through the integration of dedicated outlier handling, as well as an improved attention-aware approach, QAQ achieves up to 10x the compression ratio of the KV cache size with a neglectable impact on model performance. QAQ significantly reduces the practical hurdles of deploying LLMs, opening up new possibilities for longer-context applications. The code is available at github.com/ClubieDong/KVCacheQuantization.
翻译:大语言模型的涌现引发了自然语言处理领域在问答系统和文本生成等应用中的新一轮突破。随着对更长上下文的需求增长,键值缓存随上下文长度线性扩展的问题成为模型部署的关键瓶颈。现有方法主要依赖基于注意力分数对KV缓存进行排序以替换或驱逐等假设来压缩缓存并提升模型吞吐量。然而这些策略采用的启发式方法可能错误地驱逐关键KV缓存,导致模型性能显著下降。本文提出QAQ——一种面向KV缓存的质量自适应量化方案。我们从理论上证明键缓存和值缓存对量化具有不同敏感性,据此为非均匀量化分别设计量化策略。通过集成专用异常值处理机制和增强的注意力感知方法,QAQ在模型性能影响可忽略的前提下实现了KV缓存尺寸最高10倍的压缩比。该方案显著降低了部署大语言模型的实际障碍,为更长上下文应用开辟了新可能。代码已开源至github.com/ClubieDong/KVCacheQuantization。