As large language models (LLMs) continue to advance, the demand for higher quality and faster processing of long contexts across various applications is growing. KV cache is widely adopted as it stores previously generated key and value tokens, effectively reducing redundant computations during inference. However, as memory overhead becomes a significant concern, efficient compression of KV cache has gained increasing attention. Most existing methods perform compression from two perspectives: identifying important tokens and designing compression strategies. However, these approaches often produce biased distributions of important tokens due to the influence of accumulated attention scores or positional encoding. Furthermore, they overlook the sparsity and redundancy across different heads, which leads to difficulties in preserving the most effective information at the head level. To this end, we propose EMS to overcome these limitations, while achieving better KV cache compression under extreme compression ratios. Specifically, we introduce a Global-Local score that combines accumulated attention scores from both global and local KV tokens to better identify the token importance. For the compression strategy, we design an adaptive and unified Evict-then-Merge framework that accounts for the sparsity and redundancy of KV tokens across different heads. Additionally, we implement the head-wise parallel compression through a zero-class mechanism to enhance efficiency. Extensive experiments demonstrate our SOTA performance even under extreme compression ratios. EMS consistently achieves the lowest perplexity, improves scores by over 1.28 points across four LLMs on LongBench under a 256 cache budget, and preserves 95% retrieval accuracy with a cache budget less than 2% of the context length in the Needle-in-a-Haystack task.
翻译:随着大语言模型(LLM)的持续发展,各类应用对长上下文处理的质量与速度要求日益提高。KV缓存通过存储已生成的关键词(key)和值(value)令牌,在推理过程中有效减少了冗余计算,因而被广泛采用。然而,随着内存开销成为重要关切,KV缓存的高效压缩日益受到关注。现有方法大多从两个角度进行压缩:识别重要令牌和设计压缩策略。然而,这些方法常因累积注意力分数或位置编码的影响,导致重要令牌的分布出现偏差。此外,它们忽视了不同注意力头之间的稀疏性与冗余性,从而难以在头级别保留最有效的信息。为此,我们提出EMS以克服这些局限,并在极端压缩比下实现更好的KV缓存压缩。具体而言,我们引入了一种全局-局部评分机制,该机制结合了全局与局部KV令牌的累积注意力分数,以更准确地识别令牌重要性。在压缩策略方面,我们设计了一个自适应且统一的“驱逐-合并”框架,该框架考虑了不同头之间KV令牌的稀疏性与冗余性。此外,我们通过零类机制实现了头级并行压缩以提升效率。大量实验表明,即使在极端压缩比下,我们的方法仍达到了最先进的性能。EMS始终实现了最低的困惑度,在LongBench基准测试中,于256的缓存预算下,在四种LLM上平均提升了超过1.28分;在“大海捞针”任务中,以低于上下文长度2%的缓存预算,保持了95%的检索准确率。