Unlike autoregressive models, which generate one token at a time, dLLMs denoise a chunk of [MASK] tokens jointly and sample one or more tokens per step; despite enabling parallel decoding, this process incurs substantial computational cost due to the large chunk size of masked tokens. We observe that much of this cost is spent on repeatedly processing the preceding context and many [MASK] tokens with the same feature representations, indicating considerable computational redundancy. In this work, we revisit dLLM's redundancy from the perspective of [MASK] tokens. Through systematic analysis, we verify the redundancy of [MASK] tokens while revealing their critical role in providing structural information. Guided by these findings, we propose position-preserving [MASK] token compression and terminal-aware augmentation. By compressing redundant [MASK] computation, this approach accelerates decoding and further provides a natural extension toward context-folding-like long-context scaling under limited input-length constraints for full-sequence dLLMs such as LLaDA-8B-Instruct and LLaDA-1.5. Moreover, for block dLLMs such as LLaDA2.0-mini, it augments the context with a protected terminal [MASK] token to enhance generation quality with negligible overhead.
翻译:与自回归模型逐个生成token不同,扩散式语言模型联合去噪一整块[MASK]标记并每步采样一个或多个token;尽管实现了并行解码,但由于掩码标记块尺寸较大,该过程仍产生大量计算开销。我们观察到,这些开销中很大一部分被用于重复处理前文上下文和众多具有相同特征表示的[MASK]标记,表明存在显著的计算冗余。本文从[MASK]标记的视角重新审视dLLM的冗余问题。通过系统性分析,我们验证了[MASK]标记的冗余特性,同时揭示了其在提供结构信息方面的关键作用。基于这些发现,我们提出位置保持式[MASK]标记压缩与终端感知增强方法。通过压缩冗余的[MASK]计算,该方法加速了解码过程,并进一步为LLaDA-8B-Instruct和LLaDA-1.5等全序列dLLM在有限输入长度约束下实现类似上下文折叠的长上下文扩展提供了自然途径。此外,对于LLaDA2.0-mini等块级dLLM,该方法通过受保护的终端[MASK]标记增强上下文,以可忽略的开销提升生成质量。