Key-Value (KV) caching is a common technique to enhance the computational efficiency of Large Language Models (LLMs), but its memory overhead grows rapidly with input length. Prior work has shown that not all tokens are equally important for text generation, proposing layer-level KV cache compression to selectively retain key information. Recognizing the distinct roles of attention heads in generation, we propose HeadKV, a head-level KV cache compression method, and HeadKV-R2, which leverages a novel contextual reasoning ability estimation for compression. Our approach operates at the level of individual heads, estimating their importance for contextual QA tasks that require both retrieval and reasoning capabilities. Extensive experiments across diverse benchmarks (LongBench, LooGLE), model architectures (e.g., Llama-3-8B-Instruct, Mistral-7B-Instruct), and long-context abilities tests demonstrate that our head-level KV cache compression significantly outperforms strong baselines, particularly in low-resource settings (KV size = 64 & 128). Notably, our method retains just 1.5% of the KV cache while achieving 97% of the performance of the full KV cache on the contextual question answering benchmark.Codes are available at https://github.com/FYYFU/HeadKV
翻译:键值(KV)缓存是提升大语言模型(LLMs)计算效率的常用技术,但其内存开销随输入长度快速增长。先前研究表明并非所有词元对文本生成同等重要,并提出了层级KV缓存压缩方法以选择性保留关键信息。基于注意力头在生成过程中扮演不同角色的认识,我们提出了HeadKV——一种头级别的KV缓存压缩方法,以及HeadKV-R2——该方法利用一种新颖的上下文推理能力评估进行压缩。我们的方法在单个注意力头级别进行操作,通过评估其在需要检索与推理能力的上下文问答任务中的重要性来实现压缩。在多样化基准测试(LongBench、LooGLE)、模型架构(如Llama-3-8B-Instruct、Mistral-7B-Instruct)及长上下文能力测试上的大量实验表明,我们的头级别KV缓存压缩方法显著优于现有基线,尤其在低资源设置(KV大小=64和128)下表现突出。值得注意的是,我们的方法仅保留1.5%的KV缓存,即可在上下文问答基准测试中达到完整KV缓存97%的性能。代码已发布于https://github.com/FYYFU/HeadKV