The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the $L_2$ and the attention scores over cached KV pairs, where a low $L_2$ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the $L_2$ of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy. Moreover, without relying on the attention scores, this approach remains compatible with FlashAttention, enabling broader applicability.
翻译:大型语言模型(LLM)的部署常常受到键值(KV)缓存巨大内存需求的阻碍,尤其是在上下文长度增加时。现有减少KV缓存大小的方法,要么涉及对模型进行微调以学习压缩策略,要么利用注意力分数来减少序列长度。我们分析了基于仅解码器Transformer模型的注意力分布,观察到注意力分配模式在大多数层中保持一致。令人惊讶的是,我们发现缓存KV对的$L_2$范数与注意力分数之间存在明显的相关性,其中键嵌入的低$L_2$范数通常会导致解码时的高注意力分数。这一发现表明,一个KV对的影响力可能在被查询之前就由其键嵌入本身决定了。基于这一观察,我们根据键嵌入的$L_2$范数来压缩KV缓存。我们的实验结果表明,这种简单策略可以在语言建模和"大海捞针"任务上将KV缓存大小减少50%,在密码检索任务上减少90%,且不损失准确性。此外,由于不依赖于注意力分数,该方法与FlashAttention保持兼容,从而具有更广泛的适用性。