Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression

We study the rate-distortion limits of online KV cache compression in autoregressive language models, formulating it as sequential Wyner-Ziv source coding on the filtration induced by the model, with the next-step query as decoder side information. Empirically, across four models spanning two families and $0.5$-$3$B parameters, we find that the next-token distribution's sensitivity to context truncation decays \emph{polynomially} rather than \emph{geometrically}: a power law improves on an exponential fit by an order of magnitude in extrapolation, the fitted exponent is recovered independently from a sink-plus-recent KL measurement, and the decay is verified to be free of positional-encoding artifacts by a position-preserving ablation. Under a corresponding \emph{polynomial truncation-sensitivity} assumption, our main result characterizes the per-token memory requirement of \emph{suffix-only} cache policies: a sliding-window scheme attains distortion $\varepsilon$ with window $w = O(\varepsilon^{-1/α})$, and -- under an additional two-sided Bayes-risk condition -- a converse shows $w = Ω(\varepsilon^{-1/α})$ is necessary within this policy class, so the scaling is $Θ(\varepsilon^{-1/α})$ for suffix-only policies. Whether recurrent or propagating cache summaries can beat this scaling is left open. An explicit block-Markov scheme achieves the upper bound; its rate-of-convergence exponent matches the converse under additional forward-decay and regularity hypotheses (not implied by truncation sensitivity alone), and differs by a factor of two otherwise. Empirically, the polynomial law predicts the degradation curves of concrete cache policies: recency-based eviction (sliding, sink-plus-recent) suppresses distortion by roughly two orders of magnitude over random retention at equal budget, with a power-law decay in the budget.

翻译：我们研究了自回归语言模型中在线KV缓存压缩的率失真极限，将其建模为模型生成的滤子上的序列Wyner-Ziv信源编码，其中下一步查询作为解码器边信息。实验表明，在涵盖两个系列、参数规模为$0.5$至$3$B的四个模型中，下一词元分布对上下文截断的敏感性呈\emph{多项式}衰减而非\emph{几何}衰减：幂律相对于指数拟合在外推时精度提升一个数量级，拟合指数通过独立测量的“汇加近期”KL距离恢复，且通过保持位置消融实验验证了衰减不受位置编码伪影影响。在相应的\emph{多项式截断敏感性}假设下，我们的主要结果刻画了\emph{仅后缀}缓存策略的每词元内存需求：滑动窗口方案以$w = O(\varepsilon^{-1/α})$的窗口大小达到失真$\varepsilon$；在附加双侧贝叶斯风险条件下，逆向界表明该策略类中$w = Ω(\varepsilon^{-1/α})$是必要的，因此仅后缀策略的缩放率为$Θ(\varepsilon^{-1/α})$。是否能通过循环或传播式缓存摘要突破这一缩放率仍为开放问题。显式块马尔可夫方案可达上界；在附加前向衰减和正则性假设（截断敏感性本身无法蕴含）下，其收敛速度指数与逆向界匹配，否则相差两倍。实验表明，多项式定律可预测具体缓存策略的性能退化曲线：在等预算条件下，基于近因性的驱逐策略（滑动、汇加近期）相较于随机保留策略可将失真抑制两个数量级，且失真随预算呈幂律衰减。