While Large Language Models (LLMs) can theoretically support extensive context windows, their actual deployment is constrained by the linear growth of Key-Value (KV) cache memory. Prevailing compression strategies mitigate this through various pruning mechanisms, yet trade-off semantic recall for memory efficiency. In this work, we present LASER-KV (Layer Accumulated Selection with Exact-LSH Recall), a framework designed to test the limits of KV compression under a strict accumulative budgeting policy. We deviate from the standard fixed summary size approach by implementing a block-wise accumulation strategy governed by a protection divisor (n). This allows us to isolate the effects of compression from sliding window artifacts. Our experiments on the Babilong benchmark reveal performance degradation in previous compression methods by 15-30% on various long context tasks. LASER-KV maintains stable performance, achieving superior accuracies by a margin of upto 10% at 128k. These findings challenge the prevailing assumption that attention scores alone are a sufficient proxy for token utility.
翻译:尽管大型语言模型(LLMs)在理论上能够支持扩展的上下文窗口,但其实际部署受到键值(KV)缓存内存线性增长的限制。主流的压缩策略通过各种剪枝机制缓解这一问题,但以牺牲语义召回为代价换取内存效率。本研究提出LASER-KV(基于精确局部敏感哈希召回的分层累积选择框架),该框架旨在严格累积预算策略下探索KV压缩的极限。我们摒弃标准的固定摘要规模方法,采用由保护除数(n)控制的块状累积策略,从而将压缩效应与滑动窗口伪影进行分离。在Babilong基准测试上的实验表明,先前压缩方法在各种长上下文任务中的性能下降达15-30%。LASER-KV则保持稳定性能,在128k上下文长度下以最高10%的优势实现更优准确率。这些发现对“注意力分数足以表征词元效用”的主流假设提出了挑战。