Attention Is All You Need for KV Cache in Diffusion LLMs

This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences, and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.

翻译：本研究探讨如何为扩散大语言模型自适应地重新计算键值（KV）缓存，以在最大化预测精度的同时最小化解码延迟。现有方法的解码器在每个去噪步骤和每一层都为所有令牌重新计算QKV，尽管KV状态在大多数步骤（尤其是浅层）变化甚微，导致了大量冗余。我们提出三点观察：（1）远处的 ${\bf MASK}$ 令牌主要充当长度偏置，可在活跃预测窗口之外以块方式进行缓存；（2）KV动态性随深度增加，表明从较深层开始选择性刷新即可满足需求；（3）最受关注的令牌表现出最小的KV漂移，这为其他令牌的缓存变化提供了一个保守的下界。基于这些观察，我们提出了 ${\bf Elastic-Cache}$，这是一种无需训练、与架构无关的策略，它联合决定 ${何时}$ 刷新（通过对最受关注令牌进行注意力感知的漂移测试）以及 ${何处}$ 刷新（通过一种深度感知的调度方案，该方案从选定层开始重新计算，同时复用浅层缓存和窗口外的MASK缓存）。与固定周期方案不同，Elastic-Cache为扩散大语言模型执行自适应的、层感知的缓存更新，减少了冗余计算，加速了解码过程，且生成质量损失可忽略不计。在LLaDA-Instruct、LLaDA-1.5和LLaDA-V模型上，针对数学推理和代码生成任务的实验显示了一致的加速效果：在GSM8K（256个令牌）上达到 $8.7\times$，在更长序列上达到 $45.1\times$，在HumanEval上达到 $4.8\times$，同时始终保持着比基线更高的准确率。我们的方法实现了比现有基于置信度的方法显著更高的吞吐量（在GSM8K上为 $6.8\times$），同时保持了生成质量，使得扩散大语言模型的实际部署成为可能。