Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing token-dropping methods directly disrupt reasoning chains by removing intermediate steps, while head-reallocation methods, designed for retrieval tasks, fail to preserve the heads essential for generative reasoning. However, no existing method can identify which attention heads genuinely maintain reasoning consistency and control generation termination. To address this, we propose RLKV, which uses reinforcement learning as a probe to discover which heads contribute to reasoning quality by directly optimizing their cache usage against actual generation outcomes. This discovery naturally leads to an efficient compression strategy: we allocate full KV cache to reasoning-critical heads while aggressively compressing others. Experiments reveal that a fraction of heads proves essential for reasoning, enabling 20--50% cache reduction with near-lossless performance and up to 1.21x speedup.
翻译:推理型大语言模型通过扩展的思维链生成展现出复杂的推理行为,这些行为对解码过程中的信息丢失高度敏感,这为KV缓存压缩带来了严峻挑战。现有的令牌丢弃方法通过移除中间步骤直接破坏了推理链,而为检索任务设计的注意力头重分配方法则无法保留生成式推理所需的关键注意力头。然而,尚无现有方法能够识别哪些注意力头真正维持了推理一致性并控制生成终止。为解决此问题,我们提出了RLKV,该方法利用强化学习作为探针,通过直接优化注意力头的缓存使用以匹配实际生成结果,从而发现哪些注意力头对推理质量有贡献。这一发现自然引出了一项高效的压缩策略:我们为推理关键的头分配完整的KV缓存,同时对其他头进行激进压缩。实验表明,一小部分注意力头对推理至关重要,这使得在性能近乎无损的情况下实现20-50%的缓存缩减,并带来最高1.21倍的加速。