Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage that limit long-context applications. Our analysis of attention patterns in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining salient across decoding steps and low-relevance tokens staying unimportant, motivating selective cache eviction. We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching. By leveraging the stability of token saliency over steps, it retains critical tokens and dynamically evicts unimportant prefix/suffix entries using an attention-guided strategy. Extensive experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to 10$\times$ higher throughput than vanilla dLLMs, with comparable performance and similar peak memory costs, outperforming previous methods in efficiency and effectiveness. The code is available at https://github.com/OpenMOSS/Sparse-dLLM.
翻译:扩散大语言模型(dLLMs)在推理与并行解码方面实现了突破,但其推理过程中存在难以承受的二次计算复杂度与内存开销。现有缓存技术通过存储全层状态来加速解码,但带来了巨大的内存占用,限制了长上下文应用。我们对dLLMs中注意力模式的分析揭示了持续的跨层稀疏性:关键令牌在解码步骤中保持显著,而低相关性令牌始终不重要,这启发了选择性缓存淘汰机制。我们提出了Sparse-dLLM,这是首个无需训练、通过延迟双向稀疏缓存将动态缓存淘汰与稀疏注意力相结合的框架。该框架利用令牌显著性在步骤间的稳定性,保留关键令牌,并采用注意力引导策略动态淘汰不重要的前缀/后缀条目。在LLaDA和Dream系列模型上的大量实验表明,Sparse-dLLM相比原始dLLMs实现了高达10倍的吞吐量提升,同时保持相当的性能与相近的峰值内存成本,在效率与效果上均优于先前方法。代码发布于https://github.com/OpenMOSS/Sparse-dLLM。