Large Language Models have excelled in various fields but encounter efficiency limitations due to the substantial Key-Value (KV) cache required for long-sequence inference. Recent efforts try to evict non-critical cache elements during runtime, thereby reducing cache size within given memory budgets while preserving generation quality. Our reexamination of foundational principles reveals that prevailing methods aim to minimize an upper bound of eviction loss, quantified as the L1 distance between the pre- and post-eviction outputs of multi-head self-attention mechanisms. Moreover, our analysis indicates that the common practices of uniformly assigning budgets across different attention heads during cache eviction hinder their budget utilization, negatively impacting generation quality. In light of these findings, we propose a simple yet effective adaptive budget allocation algorithm. This algorithm not only optimizes the loss upper bound in theory but also reduces the eviction loss in practice by aligning with the intrinsic patterns of self-attention mechanisms. Integrating this algorithm into two advanced methods, we develop Ada-SnapKV and Ada-Pyramid. Extensive evaluations on 16 datasets and the Needle-in-a-Haystack test confirm that they both significantly boost performance across various tasks.
翻译:大语言模型在多个领域表现出色,但在长序列推理中因需要大量键值缓存而面临效率限制。近期研究尝试在运行时淘汰非关键缓存元素,从而在给定内存预算内缩减缓存规模,同时保持生成质量。我们对基本原理的重新审视表明,现有方法旨在最小化淘汰损失的上界,该损失量化为多头自注意力机制在淘汰前后的输出之间的L1距离。此外,我们的分析指出,当前缓存淘汰过程中为不同注意力头均匀分配预算的常见做法阻碍了其预算利用率,对生成质量产生负面影响。基于这些发现,我们提出了一种简单而有效的自适应预算分配算法。该算法不仅在理论上优化了损失上界,而且通过契合自注意力机制的内在模式,在实践中降低了淘汰损失。将该算法集成至两种先进方法中,我们开发了Ada-SnapKV与Ada-Pyramid。在16个数据集及“大海捞针”测试上的广泛评估证实,两者均能显著提升各类任务的性能。