Large language models (LLMs) with extended context windows enable powerful applications but impose significant memory overhead, as caching all key-value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically using accumulated attention weights without considering their true impact on attention outputs. We propose Optimal Brain Cache (OBCache), a principled framework that formulates cache eviction as a layer-wise structured pruning problem. Building upon the Optimal Brain Damage (OBD) theory, OBCache quantifies token saliency by measuring the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key-value pairs. Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals. Experiments on LLaMA and Qwen models demonstrate that replacing the heuristic scores in existing works, which estimate token saliency across different query positions, with OBCache's output-aware scores consistently improves long-context accuracy. Code is available at https://github.com/DreamSoul-AI/OBCache.
翻译:大语言模型(LLMs)凭借扩展的上下文窗口支持了强大多样的应用,但同时也带来了显著的内存开销,因为缓存所有键值(KV)状态所需的存储量与序列长度和批处理规模呈线性增长。现有缓存驱逐方法通过利用注意力稀疏性来缓解这一问题,但它们通常采用累积注意力权重对令牌进行启发式排名,而未考虑这些令牌对注意力输出的真实影响。我们提出最优脑缓存(Optimal Brain Cache, OBCache),这是一个将缓存驱逐形式化为逐层结构化剪枝问题的原则性框架。基于最优脑损伤(OBD)理论,OBCache通过衡量因剪除令牌而引发的注意力输出扰动,来量化令牌的显著性,并针对孤立键、孤立值以及键值对组合推导出闭式得分。我们的得分不仅考虑了注意力权重,还融合了值状态和注意力输出中的信息,从而以输出感知信号增强了现有驱逐策略。在LLaMA和Qwen模型上的实验表明,将现有工作中用于跨不同查询位置评估令牌显著性的启发式得分替换为OBCache的输出感知得分,能够持续提升长上下文任务的准确性。代码已开源:https://github.com/DreamSoul-AI/OBCache。