Diffusion Large Language Models (dLLMs) offer a compelling paradigm for natural language generation, leveraging parallel decoding and bidirectional attention to achieve superior global coherence compared to autoregressive models. While recent works have accelerated inference via KV cache reuse or heuristic decoding, they overlook the intrinsic inefficiencies within the block-wise diffusion process. Specifically, they suffer from spatial redundancy by modeling informative-sparse suffix regions uniformly and temporal inefficiency by applying fixed denoising schedules across all the decoding process. To address this, we propose Streaming-dLLM, a training-free framework that streamlines inference across both spatial and temporal dimensions. Spatially, we introduce attenuation guided suffix modeling to approximate the full context by pruning redundant mask tokens. Temporally, we employ a dynamic confidence aware strategy with an early exit mechanism, allowing the model to skip unnecessary iterations for converged tokens. Extensive experiments show that Streaming-dLLM achieves up to 68.2X speedup while maintaining generation quality, highlighting its effectiveness in diffusion decoding. The code is available at https://github.com/xiaoshideta/Streaming-dLLM.
翻译:扩散大语言模型为自然语言生成提供了一个引人注目的范式,其利用并行解码和双向注意力机制,相比自回归模型实现了更优的全局连贯性。尽管近期研究通过KV缓存复用或启发式解码加速了推理过程,但它们忽视了块级扩散过程内部固有的低效性问题。具体而言,这些方法存在空间冗余(对信息稀疏的后缀区域进行均匀建模)和时间低效(在整个解码过程中应用固定的去噪调度)的缺陷。为解决这些问题,我们提出了Streaming-dLLM,一个无需训练即可在空间和时间维度上优化推理的框架。在空间维度上,我们引入衰减引导的后缀建模方法,通过剪枝冗余的掩码标记来近似完整上下文。在时间维度上,我们采用具备提前退出机制的动态置信度感知策略,允许模型对已收敛的标记跳过不必要的迭代。大量实验表明,Streaming-dLLM在保持生成质量的同时,最高可实现68.2倍的加速,凸显了其在扩散解码中的有效性。代码已发布于 https://github.com/xiaoshideta/Streaming-dLLM。