Large Language models (LLMs) have become a research hotspot. To accelerate the inference of LLMs, storing computed caches in memory has become the standard technique. However, as the inference length increases, growing KV caches might lead to out-of-memory issues. Many existing methods address this issue through KV cache compression, primarily by preserving key tokens throughout all layers to reduce information loss. Most of them allocate a uniform budget size for each layer to retain. However, we observe that the minimum budget sizes needed to retain essential information vary across layers and models based on the perspectives of attention and hidden state output. Building on this observation, this paper proposes a simple yet effective KV cache compression method that leverages layer uncertainty to allocate budget size for each layer. Experimental results show that the proposed method can reduce memory usage of the KV caches to only $\sim$20\% when compared to Full KV inference while achieving nearly lossless performance.
翻译:大语言模型已成为研究热点。为加速大语言模型的推理,将计算后的缓存存储在内存中已成为标准技术。然而,随着推理长度的增加,不断增长的KV缓存可能导致内存不足问题。现有许多方法通过KV缓存压缩来解决此问题,主要是通过在所有层中保留关键令牌来减少信息损失。其中大多数方法为每层分配统一的保留预算大小。然而,我们从注意力机制和隐藏状态输出的角度观察到,保留关键信息所需的最小预算大小在不同层和不同模型间存在差异。基于这一观察,本文提出了一种简单而有效的KV缓存压缩方法,该方法利用层不确定性为每层分配预算大小。实验结果表明,与完整KV推理相比,所提方法可将KV缓存的内存使用量降低至仅约20%,同时实现近乎无损的性能。