The key-value (KV) cache is the dominant memory bottleneck during Transformer inference, yet little is known theoretically about how aggressively it can be compressed before multi-step reasoning degrades. We study this through $k$-hop pointer chasing on $n$ tokens under a shared KV cache of size $s$, attention dimension $m$, $H$ heads, $p$-bit precision, and a locality-respecting cache controller (satisfied by all standard KV-compression methods). We give three results. (1) Product depth lower bound (conjectured). We conjecture that any such Transformer ($n \geq 4k$, $s \leq \sqrt{n}/4$) requires depth $L = Ω(\lceil k/s \rceil \cdot \lceil \log_2 n/(Hmp) \rceil)$, and isolate the sole remaining gap as a probabilistic step on the joint distribution of cache trace and pointer chain. Unconditionally, we prove a matching upper bound $L = O(\min(k, \lceil k/s \rceil \log s) \cdot \log n/(mp))$ via windowed pointer doubling, and a max-bound $L = Ω(\max(\lceil k/s \rceil, \log n/(Hmp)))$. Closing the conjecture amounts to upgrading max to product. (2) Bandwidth barrier. The product bound binds only when $Hmp \lesssim \log n$. Any lower bound provable via per-window distinguishability counting -- including reachability, bandwidth, and combinations -- cannot exceed $\lceil k/s \rceil$ once $Hmp \geq \log_2 n$. Breaking this requires lifting unconditional communication-complexity bounds for pointer chasing to Cache-Transformer depth. (3) Adaptive vs oblivious error scaling. Under random cache over $T = \lceil \log_2 k \rceil$ doubling stages, oblivious caches give $\Pr[\mathcal{E}] \leq (s/(n-T))^T + 2T^3/n$ (exponential in $T$), while adaptive locality-respecting caches achieve $\Pr[\mathcal{E}] = s/n$ exactly, independent of $T$. The $Ω((n/s)^{T-1})$ separation explains why heavy-hitter eviction empirically dominates random eviction for multi-hop reasoning.
翻译:键值(KV)缓存是Transformer推理过程中的主要内存瓶颈,然而目前关于在多步推理性能下降之前缓存能被压缩到何种程度,理论上仍知之甚少。我们通过共享大小为 $s$ 的KV缓存、注意力维度 $m$、$H$ 个头、$p$ 位精度以及一个满足局部性约束的缓存控制器(所有标准KV压缩方法均满足)下的 $n$ 个令牌上的 $k$ 跳指针追逐问题来研究这一点。我们给出三项结果。(1)乘积深度下界(猜想)。我们猜想任何此类Transformer($n \geq 4k$,$s \leq \sqrt{n}/4$)需要深度 $L = Ω(\lceil k/s \rceil \cdot \lceil \log_2 n/(Hmp) \rceil)$,并将尚存的唯一差距归结为关于缓存轨迹与指针链联合分布的概率步骤。无条件地,我们通过窗口化指针倍增证明了匹配的上界 $L = O(\min(k, \lceil k/s \rceil \log s) \cdot \log n/(mp))$,以及一个最大界 $L = Ω(\max(\lceil k/s \rceil, \log n/(Hmp)))$。要闭合该猜想,只需将最大界升级为乘积界。(2)带宽壁垒。乘积界仅在 $Hmp \lesssim \log n$ 时成立。任何通过逐窗口可区分性计数(包括可达性、带宽及其组合)可证明的下界,一旦 $Hmp \geq \log_2 n$,均不能超过 $\lceil k/s \rceil$。突破此界限需要将指针追逐的无条件通信复杂度下界提升至缓存Transformer深度。(3)自适应与 oblivious 错误缩放。在 $T = \lceil \log_2 k \rceil$ 轮倍增阶段下的随机缓存中,oblivious 缓存满足 $\Pr[\mathcal{E}] \leq (s/(n-T))^T + 2T^3/n$(指数级于 $T$),而自适应局部性约束缓存精确达到 $\Pr[\mathcal{E}] = s/n$,与 $T$ 无关。$\Omega((n/s)^{T-1})$ 的差距解释了为何对于多跳推理,重击者驱逐策略在经验上明显优于随机驱逐策略。