Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs). By allowing the attention operation to scale linearly rather than quadratically with the total sequence length, KV caching significantly enhances generation throughput. However, due to large context lengths in the modern LLMs, the memory footprint of the KV is a huge bottleneck for model deployment directly impacting the model's batch size, hindering its ability to deliver high-throughput. Existing research addresses this challenge using several techniques, such as discarding low-attention tokens, quantization, and matrix approximation which typically lead to a negative impact on the model accuracy. In this paper, We propose KVCrush technology which can be combined with many KV compression technologies to improve the model accuracy at a much smaller memory. KVCrush provides an alternate representation scheme for key-value states, along with a low-overhead token pruning algorithm that accounts for the token distribution in the KV cache, which in turn allows for a a smaller footprint while maintaining the accuracy of the model. Based on our results, KVCrush reduces LongBench KV Cache size by 4x with less than 1% accuracy drop and achieves state-of-the-art average accuracy with minimal overhead, incurring less than 0.5% total inference latency. KVCrush not only outperforms the accuracy of state-of-the-art importance-based token retention schemes but is also compatible with typical practical LLM deployments using KV cache paging schemes such as vLLM and mixed precision quantization.
翻译:键值(KV)缓存已成为加速大语言模型(LLM)推理的关键优化技术。通过使注意力运算的复杂度随总序列长度呈线性而非二次方增长,KV缓存显著提升了生成吞吐量。然而,由于现代LLM的上下文长度较大,KV缓存的内存占用已成为模型部署的主要瓶颈,直接影响批处理规模,制约了高吞吐能力的实现。现有研究通过丢弃低注意力令牌、量化、矩阵近似等技术应对这一挑战,但这些方法通常会对模型精度产生负面影响。本文提出KVCrush技术,该技术可与多种KV压缩方案结合,在显著降低内存占用的同时提升模型精度。KVCrush提供了一种键值状态的替代表示方案,并配备基于KV缓存中令牌分布的低开销令牌剪枝算法,从而在保持模型精度的同时实现更小的内存占用。实验结果表明,KVCrush将LongBench的KV缓存大小压缩至1/4时精度损失低于1%,并以极小开销(总推理延迟增加不足0.5%)达到最优平均精度。KVCrush不仅超越了当前最先进的基于重要性的令牌保留方案的精度,还能兼容采用KV缓存分页方案(如vLLM)及混合精度量化的典型实际LLM部署场景。