Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This KV-cache reduction also leads to a 90% memory saving and a 6.6X throughput over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.
翻译:推理模型在自我反思与思维链推理任务中展现出卓越性能,但其常生成过长的输出序列,导致推理过程中键值(KV)缓存急剧膨胀。尽管思维链推理能显著提升复杂推理任务的性能,但在现有KV缓存压缩方法下部署时可能引发推理失败。为此,我们提出面向推理模型的冗余感知KV缓存压缩方法(R-KV),这是一种专门针对推理模型中冗余令牌的新型压缩技术。本方法仅需10%的KV缓存即可保持近100%的全缓存性能,显著优于仅能达到60%性能的现有KV缓存基线方法。值得注意的是,R-KV在使用16%的KV缓存时甚至能达到全缓存性能的105%。这种KV缓存的缩减同时带来90%的内存节约,并在标准思维链推理中实现6.6倍的吞吐量提升。实验结果表明,在两类数学推理数据集上,R-KV始终优于现有的KV缓存压缩基线方法。