Masked Diffusion Language Models (MDLMs) enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.
翻译:掩码扩散语言模型(MDLMs)通过并行标记解码机制,为自回归生成的序列化特性提供了有前景的替代方案。然而,其迭代去噪过程仍需在每个步骤重复处理完整序列,导致计算开销居高不下。我们观察到,在扩散步骤的迭代过程中,多数标记表征保持稳定;仅有一小部分标记(我们称之为显著性标记)对下一轮更新产生实质性贡献。基于这种时序稀疏性,我们提出DyLLM——一种无需重新训练的高效推理框架,通过选择性计算显著性标记来加速解码过程。DyLLM通过测量相邻去噪步骤间注意力上下文的余弦相似度来识别显著性标记,仅对显著性标记重新计算前馈与注意力运算,同时对其余标记复用缓存的激活值。在多样化推理与代码生成基准测试中,DyLLM在基本保持LLaDA、Dream等前沿模型基准精度的前提下,实现了最高达9.6倍的吞吐量提升。