Diffusion large language models (dLLMs) present a promising alternative to dominant autoregressive models (ARMs) by the ability of parallel decoding at the expense of substantial computation and memory costs. Specifically, the cache mechanism for bidirectional attention in dLLMs demands large memory footprint, restricting their ability to handle long contexts under resource-limited settings. Existing cache eviction strategies are designed for ARMs and ignore the unique characteristics of dLLMs, thus leading to unsatisfactory performance. To address these challenges, we introduce MaskKV, a training-free cache eviction framework tailored to dLLMs, focusing on the effect of mask tokens in dLLMs. MaskKV is built on two key innovations: (1) a mask-query guided scoring mechanism that leverages attention weights to identify and evict less critical prompt tokens for each head; (2) an adaptive cache budgeting strategy that improves efficiency by reducing allocation in intermediate layers and concentrating resources on prompt-preferring heads. On LLaDA with MaskKV, compressing the KV cache to only 256 pairs (less than 5% of tokens) retains 94% of the full-cache performance on LongBench and achieves up to 31x acceleration at 32k prompt length. The code is publicly available at: https://github.com/jianuo-huang/MaskKV
翻译:扩散大语言模型(dLLMs)通过并行解码能力为当前主流的自回归模型(ARMs)提供了一种有前景的替代方案,但其代价是高昂的计算与内存开销。具体而言,dLLMs中双向注意力机制的缓存机制需要占用大量内存,限制了其在资源受限环境下处理长上下文的能力。现有的缓存驱逐策略专为ARMs设计,忽视了dLLMs的独特特性,导致性能表现不尽如人意。为应对这些挑战,我们提出了MaskKV——一个专为dLLMs设计的免训练缓存驱逐框架,重点关注掩码令牌在dLLMs中的作用机制。MaskKV基于两大核心创新构建:(1)掩码查询引导的评分机制,利用注意力权重为每个注意力头识别并驱逐重要性较低的提示令牌;(2)自适应缓存预算分配策略,通过减少中间层的缓存分配并将资源集中于偏好提示信息的注意力头来提升效率。在集成MaskKV的LLaDA模型上,将键值缓存压缩至仅256对(不足令牌总数的5%)时,在LongBench基准测试中仍能保持全缓存性能的94%,并在32k提示长度下实现最高31倍的推理加速。代码已公开于:https://github.com/jianuo-huang/MaskKV