The growing computational and memory demands of the Key-Value (KV) cache significantly limit the ability of Large Language Models (LLMs). While KV merging has emerged as a promising solution, existing methods that rely on empirical observations of KV asymmetry and gradient-based Hessian approximations lack a theoretical foundation and incur suboptimal compression and inference overhead. To bridge these gaps, we establish a theoretical framework that characterizes this asymmetry through the spectral energy distribution of projection weights, demonstrating that concentrated spectra in Query/Key weights induce feature homogeneity, whereas dispersed spectra in Value weights preserve heterogeneity. Then, we introduce KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient. Extensive experiments across various models and benchmarks demonstrate that KVSlimmer consistently outperforms SOTA methods. For instance, on Llama3.1-8B-Instruct, it improves the LongBench average score by 0.92 while reducing memory costs and latency by 29% and 28%, respectively.
翻译:键值(KV)缓存的日益增长的计算与内存需求严重制约了大语言模型(LLMs)的能力。尽管KV合并已成为一种有前景的解决方案,但现有方法依赖于对KV非对称性的经验观察和基于梯度的Hessian近似,缺乏理论基础,并导致次优的压缩和推理开销。为弥补这些不足,我们建立了一个理论框架,通过投影权重的谱能量分布来刻画这种非对称性,证明了查询/键权重中集中的谱会导致特征同质性,而值权重中分散的谱则能保持异质性。接着,我们提出了KVSlimmer,一种高效算法,它通过数学上精确的公式捕获精确的Hessian信息,并仅利用前向传播变量推导出一个闭式解,从而形成一种无需梯度、内存和时间高效的方案。在各种模型和基准测试上的大量实验表明,KVSlimmer始终优于最先进(SOTA)方法。例如,在Llama3.1-8B-Instruct上,它将LongBench平均分数提高了0.92,同时分别将内存成本和延迟降低了29%和28%。