Efficient key-value (KV) cache management is crucial for the practical deployment of large language models (LLMs), yet existing compression techniques often incur a trade-off between performance degradation and computational overhead. We propose a novel gating-based KV cache eviction method for frozen-weight LLMs that achieves high compression ratios with negligible computational cost. Our approach introduces lightweight sink-attention gating modules to identify and retain critical KV pairs, and integrates seamlessly into both the prefill and decoding stages. The proposed gate training algorithm relies on forward passes of an LLM, avoiding expensive backpropagation, while achieving strong task generalization through a task-agnostic reconstruction objective. Extensive experiments across the Qwen2.5-1M, Qwen3, and Gemma3 families show that our method maintains near-lossless performance while evicting up to 70% of the KV cache. The results are consistent across a wide range of tasks, including long-context understanding, code comprehension, and mathematical reasoning, demonstrating the generality of our approach.
翻译:高效管理键值(KV)缓存对于大语言模型(LLMs)的实际部署至关重要,然而现有的压缩技术往往需要在性能下降与计算开销之间进行权衡。我们提出了一种新颖的基于门控的KV缓存淘汰方法,适用于权重冻结的LLMs,该方法能以可忽略的计算成本实现高压缩比。我们的方法引入了轻量级的汇注意力门控模块来识别并保留关键的KV对,并能无缝集成到预填充和解码阶段。所提出的门控训练算法依赖于LLM的前向传播,避免了昂贵的反向传播,同时通过任务无关的重建目标实现了强大的任务泛化能力。在Qwen2.5-1M、Qwen3和Gemma3系列模型上进行的大量实验表明,我们的方法在淘汰高达70%的KV缓存的同时,保持了近乎无损的性能。该结果在包括长上下文理解、代码理解和数学推理在内的广泛任务中保持一致,证明了我们方法的通用性。