Existing key-value (KV) cache compression methods for large language models (LLMs) often rely on token eviction, which risks losing critical local information in both long prefilling and decoding scenarios. When extrapolating beyond the pretrained context length, their performance degrades sharply on long-context benchmarks. Motivated by the observation in the frequency domain that the context information is concentrated in the low-frequency components, we propose FreqKV, a parameter-free and architecture-agnostic approach. It iteratively compresses the increasing KV cache in the frequency domain, allowing models to process lengthy contexts efficiently. With minimal training at 8K length, FreqKV extends the context window of LLaMA-2-7B up to 256K tokens while maintaining stable perplexity. Extensive experiments across prefilling and decoding demonstrate that FreqKV enables robust context window extension and consistently outperforms existing KV cache compression methods on LLaMA-2 and LLaMA-3, highlighting its effectiveness for both understanding and generation in long contexts.
翻译:现有的大型语言模型键值缓存压缩方法通常依赖于令牌淘汰机制,这在长文本预填充和解码场景中均存在丢失关键局部信息的风险。当模型需要处理超出预训练上下文长度的序列时,这些方法在长上下文基准测试上的性能会急剧下降。基于频域分析中观察到的上下文信息主要集中于低频分量的现象,本文提出FreqKV——一种无需额外参数且与模型架构无关的压缩方法。该方法通过频域对持续增长的键值缓存进行迭代压缩,使模型能够高效处理超长上下文。仅需在8K长度上进行极少量训练,FreqKV即可将LLaMA-2-7B的上下文窗口扩展至256K令牌,同时保持稳定的困惑度。在预填充与解码场景下的广泛实验表明,FreqKV能够实现鲁棒的上下文窗口扩展,在LLaMA-2和LLaMA-3模型上持续优于现有键值缓存压缩方法,突显了其在长上下文理解与生成任务中的有效性。