How to efficiently serve Large Language Models (LLMs) has become a pressing issue because of their huge computational cost in their autoregressive generation process. To mitigate computational costs, LLMs often employ the KV Cache technique to improve the generation speed. While improving the computational efficiency, the storage requirements of the KV cache are substantial, particularly in long-context scenarios, leading to significant memory consumption. Existing KV cache eviction methods often degrade the performance of LLMs in long-context scenarios due to the information loss introduced by eviction. In this paper, we propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks without significant performance degradation under constrained memory budgets. Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence. To facilitate merging, we develop an effective yet straightforward merging set identification algorithm to identify suitable KV states for merging. Our merging set identification algorithm stimulates the second observation that KV cache sparsity, from similarity perspective, is independent of the dataset and remains persistent at the model level. Subsequently, we propose a Gaussian kernel weighted merging algorithm to selectively merge all states within each merging set. We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets, applying it to models including Llama2-7B-chat and Llama2-13B-chat. Using the LongBench and ZeroScroll benchmarks, we compare our method with other KV cache compression techniques, including H2O and CaM, showing that our method achieves superior performance across tasks with both 50% and 35% KV cache budgets.
翻译:如何高效地服务大语言模型已成为一个紧迫问题,因为其在自回归生成过程中计算成本巨大。为降低计算成本,LLM通常采用KV缓存技术来提高生成速度。尽管提高了计算效率,KV缓存的存储需求非常可观,尤其在长上下文场景下,会导致显著的内存消耗。现有的KV缓存逐出方法往往会因逐出引入的信息损失,导致LLM在长上下文场景中的性能下降。本文提出一种新颖的KV缓存合并方法,称为KVMerger,旨在为长上下文任务实现自适应的KV缓存压缩,在受限内存预算下不会造成显著的性能下降。我们的方法源于一个有趣的观察:在单个序列内,关键状态在词元级别表现出高度相似性。为便于合并,我们开发了一种有效且简洁的合并集识别算法,以识别适合合并的KV状态。我们的合并集识别算法激发了第二个观察:从相似性角度看,KV缓存稀疏性与数据集无关,且在模型层面持续存在。随后,我们提出一种高斯核加权合并算法,以选择性地合并每个合并集中的所有状态。我们进行了大量实验,将KVMerger应用于包括Llama2-7B-chat和Llama2-13B-chat在内的模型,以证明其在受限内存预算下对长上下文任务的有效性。使用LongBench和ZeroScroll基准测试,我们将本方法与其他KV缓存压缩技术(包括H2O和CaM)进行比较,结果表明我们的方法在50%和35% KV缓存预算下,跨任务均实现了更优的性能。