注意力机制是扩散大语言模型中KV缓存的全部所需 (Attention Is All You Need for KV Cache in Diffusion LLMs)

This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), and $45.1\times$ on longer sequences, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.

翻译：本研究探讨如何为扩散大语言模型自适应地重新计算键值缓存，以在最大化预测精度的同时最小化解码延迟。现有方法的解码器在每个去噪步骤和每一层为所有令牌重新计算QKV，尽管KV状态在大多数步骤中变化甚微（尤其在浅层），导致大量冗余。我们提出三点观察：（1）远处的${\bf MASK}$令牌主要起长度偏置作用，可在活跃预测窗口之外以块状方式缓存；（2）KV动态性随深度增加，表明从深层开始选择性刷新即已足够；（3）最受关注的令牌表现出最小的KV漂移，为其他令牌的缓存变化提供了保守下界。基于此，我们提出${\bf Elastic-Cache}$——一种免训练、架构无关的策略，联合决策${何时}$刷新（通过对最受关注令牌进行注意力感知的漂移测试）以及${何处}$刷新（通过深度感知调度，从选定层开始重新计算，同时复用浅层缓存和窗口外MASK缓存）。与固定周期方案不同，Elastic-Cache为扩散大语言模型执行自适应的层感知缓存更新，减少冗余计算并加速解码，且生成质量损失可忽略。在LLaDA-Instruct、LLaDA-1.5和LLaDA-V模型上进行的数学推理与代码生成任务实验表明：在GSM8K（256令牌）上实现$8.7\times$加速，在更长序列上实现$45.1\times$加速，同时始终保持高于基线的准确率。本方法在保持生成质量的前提下，实现了显著高于现有基于置信度方法的吞吐量（GSM8K上$6.8\times$），为扩散大语言模型的实际部署提供了可能。