Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction

from arxiv, 38 pages, 6 figures, 25 tables (includes one longtable). Code and figure regeneration scripts: https://github.com/gpgabriel25/KVCacheBoundaryProtection

We study KV cache eviction under a shared globally capped decode-time harness. Seven policies (LRU, H2O, SnapKV, StreamingLLM, Ada-KV, QUEST, Random) share a prompt-boundary vulnerability: without structural protection, they collapse to near-zero quality on six pure-transformer models (F1$\leq$0.064). Reserving 10\% of cache at each boundary recovers 69--90\% of the $C{=}2{,}048$ reference-ceiling quality on seven LongBench models at $C{=}256$ (13\% retention); a ten-model panel spans 68--98\%. An attention-mass pilot (Qwen2.5-3B, $N{=}30$) suggests why: the position-0 sink holds ${\sim}75\%$ of prefix mass, while other boundary tokens sit near ${\sim}0.41{\times}$ uniform expectation, so attention scorers retain the sink but still drop structurally critical tokens. With protection, simplified score-isolation variants are TOST-equivalent to LRU at $K{=}32$ ($Δ{=}0.02$); at $K{=}8$, attention policies pairwise converge yet beat LRU by 0.011--0.021 F1 across $C{=}256$ and $C{=}512$. Faithful Ada-KV/QUEST add ${\sim}0.03$--$0.04$ F1 on Mistral-7B and Phi-3.5 beyond simplified variants. A NIAH-32K regime-transfer pilot on Qwen3-4B (decode vs.\ prefill, $C{\in}\{512,2048\}$) shows near-identical protection lifts (ratio 0.99--1.00). At 64K, protection helps but recovery is modest; faithful per-head scoring matches full-cache ceiling on Gemma-3-4B at 6.3\% retention only when the model already supports strong 64K retrieval without eviction. Overall: protection dominates; scoring differences are secondary once boundaries are guarded; per-head allocation gives a further modest gain.

翻译：我们研究了在共享全局限制解码时缓存下的KV缓存驱逐策略。七种策略（LRU、H2O、SnapKV、StreamingLLM、Ada-KV、QUEST、随机）均存在提示边界漏洞：缺乏结构保护时，它们在六个纯Transformer模型上的质量几乎降至零（F1≤0.064）。在每个边界保留10%的缓存，可在七个LongBench模型的C=256（13%保留率）下恢复C=2048参考上限质量的69-90%；十模型面板的范围为68-98%。一项注意力质量试点（Qwen2.5-3B，N=30）揭示了原因：位置0的汇聚点持有约75%的前缀质量，而其他边界标记约保持在均匀期望的0.41倍，因此注意力评分器虽保留了汇聚点，但仍丢弃了结构关键标记。在保护机制下，简化评分隔离变体在K=32时与LRU的TOST等效（Δ=0.02）；在K=8时，注意力策略彼此成对收敛，但在C=256和C=512下F1比LRU高0.011-0.021。忠实Ada-KV/QUEST在Mistral-7B和Phi-3.5上比简化变体额外增加约0.03-0.04 F1。针对Qwen3-4B的NIAH-32K领域迁移试点（解码vs预填充，C∈{512,2048}）显示保护提升近乎一致（比率0.99-1.00）。在64K时，保护有帮助但恢复有限；仅在模型本身支持无驱逐的强64K检索时，忠实每头评分才能在6.3%保留率下匹配Gemma-3-4B的全缓存上限。总体而言：保护主导；一旦边界得到保护，评分差异次要；每头分配带来适度额外收益。