Key-Value (KV) Caching has become an essential technique for accelerating the inference speed and throughput of generative Large Language Models~(LLMs). However, the memory footprint of the KV cache poses a critical bottleneck in LLM deployment as the cache size grows with batch size and sequence length, often surpassing even the size of the model itself. Although recent methods were proposed to select and evict unimportant KV pairs from the cache to reduce memory consumption, the potential ramifications of eviction on the generative process are yet to be thoroughly examined. In this paper, we examine the detrimental impact of cache eviction and observe that unforeseen risks arise as the information contained in the KV pairs is exhaustively discarded, resulting in safety breaches, hallucinations, and context loss. Surprisingly, we find that preserving even a small amount of information contained in the evicted KV pairs via reduced precision quantization substantially recovers the incurred degradation. On the other hand, we observe that the important KV pairs must be kept at a relatively higher precision to safeguard the generation quality. Motivated by these observations, we propose \textit{Mixed-precision KV cache}~(MiKV), a reliable cache compression method that simultaneously preserves the context details by retaining the evicted KV pairs in low-precision and ensure generation quality by keeping the important KV pairs in high-precision. Experiments on diverse benchmarks and LLM backbones show that our proposed method offers a state-of-the-art trade-off between compression ratio and performance, compared to other baselines.
翻译:键值(KV)缓存已成为加速生成式大语言模型(LLM)推理速度与吞吐量的关键技术。然而,随着缓存规模随批次大小和序列长度增长,KV缓存的内存占用成为LLM部署的关键瓶颈,其大小甚至常超越模型本身。尽管近期有方法提出从缓存中选择并剔除不重要的KV对以降低内存消耗,但剔除操作对生成过程的潜在影响尚未得到充分研究。本文系统探究了缓存剔除的负面效应,并发现当KV对中包含的信息被完全丢弃时,会引发安全漏洞、幻觉产生和上下文丢失等不可预见的风险。令人惊讶的是,我们观察到通过低精度量化保留被剔除KV对中的少量信息,即可有效缓解上述性能退化。另一方面,重要KV对需保持较高精度以保障生成质量。基于上述发现,我们提出混合精度KV缓存(MiKV)——一种可靠的缓存压缩方法,通过低精度保留被剔除KV对以维持上下文细节,同时以高精度保留重要KV对确保生成质量。在多种基准测试和LLM骨干网络上的实验表明,与其他基线方法相比,本方法在压缩比与性能之间实现了最先进的平衡。