We study KV cache eviction under a shared globally capped decode-time harness. Seven policies (LRU, H2O, SnapKV, StreamingLLM, Ada-KV, QUEST, Random) share a prompt-boundary vulnerability: without structural protection, they collapse to near-zero quality on six pure-transformer models (F1$\leq$0.064). Reserving 10\% of cache at each boundary recovers 69--90\% of the $C{=}2{,}048$ reference-ceiling quality on seven LongBench models at $C{=}256$ (13\% retention); a ten-model panel spans 68--98\%. An attention-mass pilot (Qwen2.5-3B, $N{=}30$) suggests why: the position-0 sink holds ${\sim}75\%$ of prefix mass, while other boundary tokens sit near ${\sim}0.41{\times}$ uniform expectation, so attention scorers retain the sink but still drop structurally critical tokens. With protection, simplified score-isolation variants are TOST-equivalent to LRU at $K{=}32$ ($Δ{=}0.02$); at $K{=}8$, attention policies pairwise converge yet beat LRU by 0.011--0.021 F1 across $C{=}256$ and $C{=}512$. Faithful Ada-KV/QUEST add ${\sim}0.03$--$0.04$ F1 on Mistral-7B and Phi-3.5 beyond simplified variants. A NIAH-32K regime-transfer pilot on Qwen3-4B (decode vs.\ prefill, $C{\in}\{512,2048\}$) shows near-identical protection lifts (ratio 0.99--1.00). At 64K, protection helps but recovery is modest; faithful per-head scoring matches full-cache ceiling on Gemma-3-4B at 6.3\% retention only when the model already supports strong 64K retrieval without eviction. Overall: protection dominates; scoring differences are secondary once boundaries are guarded; per-head allocation gives a further modest gain.
翻译:我们研究了在共享全局限制解码时缓存下的KV缓存驱逐策略。七种策略(LRU、H2O、SnapKV、StreamingLLM、Ada-KV、QUEST、随机)均存在提示边界漏洞:缺乏结构保护时,它们在六个纯Transformer模型上的质量几乎降至零(F1≤0.064)。在每个边界保留10%的缓存,可在七个LongBench模型的C=256(13%保留率)下恢复C=2048参考上限质量的69-90%;十模型面板的范围为68-98%。一项注意力质量试点(Qwen2.5-3B,N=30)揭示了原因:位置0的汇聚点持有约75%的前缀质量,而其他边界标记约保持在均匀期望的0.41倍,因此注意力评分器虽保留了汇聚点,但仍丢弃了结构关键标记。在保护机制下,简化评分隔离变体在K=32时与LRU的TOST等效(Δ=0.02);在K=8时,注意力策略彼此成对收敛,但在C=256和C=512下F1比LRU高0.011-0.021。忠实Ada-KV/QUEST在Mistral-7B和Phi-3.5上比简化变体额外增加约0.03-0.04 F1。针对Qwen3-4B的NIAH-32K领域迁移试点(解码vs预填充,C∈{512,2048})显示保护提升近乎一致(比率0.99-1.00)。在64K时,保护有帮助但恢复有限;仅在模型本身支持无驱逐的强64K检索时,忠实每头评分才能在6.3%保留率下匹配Gemma-3-4B的全缓存上限。总体而言:保护主导;一旦边界得到保护,评分差异次要;每头分配带来适度额外收益。