With the advancements in long-context inference capabilities of large language models (LLMs), the KV cache has become one of the foundational components. However, its substantial GPU memory consumption makes KV cache compression a key technique for enabling efficient LLM inference in industrial scenarios. While recent studies have focused on optimizing the memory occupied by the KV cache, they overlook two critical factors: preserving semantic coherence and considering task-specific characteristic during compression. To address these limitations, we propose a novel task-adaptive KV cache window selection method, WindowKV. WindowKV dynamically selects local semantic windows consisting of consecutive tokens, according to task-specific characteristics, ensuring the retained KV cache captures continuous, essential context. Additionally, we introduce an intra-group layer KV cache indices sharing strategy to reduce computational overhead, achieving a balance between performance and efficiency. We rigorously evaluate WindowKV on the LongBench benchmark, and the results demonstrate that it maintains a performance comparable to full KV cache retention while using only 12% of the original KV cache, significantly reducing memory requirements. Furthermore, our method also achieves state-of-the-art results in the Needle-in-a-Haystack evaluation, highlighting its effectiveness and robustness.
翻译:随着大语言模型(LLM)长上下文推理能力的进步,键值(KV)缓存已成为其基础组件之一。然而,其巨大的GPU内存消耗使得KV缓存压缩成为工业场景中实现高效LLM推理的关键技术。尽管近期研究聚焦于优化KV缓存占用的内存,但它们忽略了两个关键因素:在压缩过程中保持语义连贯性以及考虑任务特定特性。为应对这些局限性,我们提出了一种新颖的任务自适应KV缓存窗口选择方法——WindowKV。WindowKV根据任务特定特性,动态选择由连续令牌组成的局部语义窗口,确保保留的KV缓存能够捕捉连续且关键的上下文信息。此外,我们引入了一种组内层间KV缓存索引共享策略,以减少计算开销,从而在性能与效率之间取得平衡。我们在LongBench基准测试上对WindowKV进行了严格评估,结果表明,在仅使用原始KV缓存12%的情况下,其性能仍可与完整KV缓存保留相媲美,显著降低了内存需求。此外,我们的方法在“大海捞针”评估中也取得了最先进的结果,突显了其有效性与鲁棒性。