The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8\% compression ratio on LLaMA3.1-8B while retaining 98.6\% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.
翻译:大型语言模型(LLM)日益增长的上下文长度导致键值(KV)缓存不断增大,限制了其在资源受限环境中的部署。现有的免训练KV缓存压缩方法通常依赖于低秩近似或标量量化,难以同时实现高压缩比与高重建保真度。本文提出VQKV,一种新颖的免训练方法,通过引入向量量化(VQ)技术获得高度压缩的KV表示,同时保持较高的模型保真度,仅需少量整数索引即可表示数千个浮点数值。实验表明,VQKV在LLaMA3.1-8B模型上实现了82.8%的压缩率,在LongBench基准测试中保持基线性能的98.6%,并在相同内存占用量下支持4.3倍的生成长度。