Masked diffusion language models enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of representative open-source diffusion LLMs, LLaDA, and Dream.
翻译:掩码扩散语言模型支持并行令牌解码,为自回归生成的顺序性限制提供了有前景的替代方案。然而,其迭代去噪过程依然计算开销巨大,因为每一步都需要重复处理整个序列。我们观察到,在这些扩散步骤中,绝大多数令牌表征保持稳定;仅有一小部分(我们称之为显著性标记)对下一次更新具有实质性贡献。利用这种时间稀疏性,我们提出了DyLLM——一种无需训练的推理框架,通过选择性计算仅针对这些显著性标记来加速解码。DyLLM通过测量相邻去噪步骤之间注意力上下文的余弦相似度来识别显著性,仅对显著性标记重新计算前馈和注意力操作,同时复用其他标记的缓存激活。在多样化的推理与代码生成基准测试中,DyLLM在基本保持代表性开源扩散大语言模型(LLaDA和Dream)基准精度的前提下,实现了最高9.6倍的吞吐量提升。