Key-value (KV) caching is essential for large language model inference, yet its memory overhead poses a critical bottleneck for long-context generation. Existing eviction policies predominantly rely on empirical heuristics, lacking a rigorous theoretical foundation. This work rethinks KV cache eviction through the lens of the Information Bottleneck principle. Under a linear-Gaussian surrogate of attention, we derive a closed-form mutual information objective that characterizes the effective information capacity of a retained KV cache subset. This formulation reveals that a wide range of existing eviction strategies can be interpreted as different approximations of the same capacity-maximization principle. Guided by this insight, we introduce CapKV, a capacity-aware eviction method that directly targets information preservation via a log-determinant approximation using statistical leverage scores. This approach replaces heuristic selection with a theoretically grounded mechanism that preserves the maximum predictive signal. Extensive experiments across multiple models and long-context benchmarks show that CapKV consistently outperforms prior methods, achieving a better trade-off between memory efficiency and generational fidelity.
翻译:键值(KV)缓存对于大型语言模型的推理至关重要,但其内存开销对长文本生成构成了关键瓶颈。现有驱逐策略主要依赖经验性启发式方法,缺乏严格的理论基础。本研究从信息瓶颈原则视角重新审视KV缓存驱逐问题。在注意力机制的线性-高斯近似下,我们推导出闭式互信息目标,该目标量化了保留KV缓存子集的有效信息容量。这一公式揭示,现有多种驱逐策略可被解释为同一容量最大化原则的不同近似形式。基于此洞察,我们提出CapKV——一种通过统计杠杆分数进行对数行列式近似、直接针对信息保留的容量感知驱逐方法。该方法以理论驱动的机制替代启发式选择,能够保留最大预测信号。跨多个模型与长文本基准的广泛实验表明,CapKV始终优于现有方法,在内存效率与生成保真度之间实现了更优权衡。