Large Language Models have excelled in various fields but encounter efficiency limitations due to the extensive KV cache required for long sequences inference. Many efforts try to evict non-critical cache elements during runtime, thereby reducing cache size within a given memory budget while preserving generation quality. Our reexamination of their underlying principles discerns that prevailing strategies essentially aim to minimize an upper bound of eviction loss within a specific budget allocation. However, we observe that the current practice of uniformly allocating budgets across different attention heads during the eviction procedure tends to degrade the quality of generation posten-eviction. In light of these findings, we propose a simple yet effective adaptive allocation algorithm that not only theoretically ensures its loss upper bound does not exceed that of previous uniform allocation methods, but also effectively aligns with the characteristics of the self-attention mechanism, thus practically reducing the upper bound. Further, integrating this algorithm with two of the most advanced methods yields Ada-SnapKV and Ada-Pyramid. Extensive experimental validation across 16 datasets and the Needle-in-a-Haystack test confirm that Ada-SnapKV and Ada-Pyramid achieve further enhancements, establishing new benchmarks in state-of-the-art performance.
翻译:大语言模型已在多个领域展现出卓越性能,但在处理长序列推理时,由于需要维护庞大的KV缓存而面临效率瓶颈。许多研究工作尝试在运行时淘汰非关键的缓存元素,从而在给定内存预算内缩减缓存规模,同时保持生成质量。我们重新审视了这些方法的底层原理,发现主流策略本质上旨在最小化特定预算分配下的淘汰损失上界。然而,我们观察到,当前在淘汰过程中对不同注意力头进行均匀预算分配的实践,往往会导致淘汰后生成质量的下降。基于这些发现,我们提出了一种简洁高效的自适应分配算法,该算法不仅在理论上保证其损失上界不超过先前均匀分配方法的上界,而且能有效契合自注意力机制的特性,从而在实践中进一步降低损失上界。进一步地,将该算法与两种最先进的方法相结合,我们提出了Ada-SnapKV与Ada-Pyramid。在16个数据集及"大海捞针"测试上的广泛实验验证表明,Ada-SnapKV与Ada-Pyramid实现了进一步的性能提升,创造了新的最先进性能基准。