Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step -- even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position -- thereafter skipping its query projection and feed-forward sublayers -- while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per-iteration computational cost from $O(N^2d)$ to $O(MNd)$ where $N$ is the sequence length, $M$ is the number of unlocked token positions, and $d$ is the model dimension. In practice, $M$ decreases as the iteration progresses, yielding substantial savings. On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30--50% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities. Our code will be available at https://daioba.github.io/surelock .
翻译:掩码扩散语言模型通过迭代采样逐步解除词元掩码来生成序列。然而,它们在每一步仍会为每个词元位置重新计算注意力机制和前馈网络模块——即使许多已解除掩码的词元本质上已固定不变,这导致了大量的计算资源浪费。我们提出SureLock方法:当已解除掩码位置的后验分布在连续步骤中趋于稳定(即满足我们的确定条件)时,我们锁定该位置——此后跳过其查询投影和前馈子层——同时缓存其注意力键值与值向量,以便其他位置能够继续关注该位置。这将每次迭代的主要计算成本从$O(N^2d)$降低至$O(MNd)$,其中$N$为序列长度,$M$为未锁定词元位置的数量,$d$为模型维度。在实际应用中,$M$随着迭代进程逐步减少,从而实现了显著的计算节省。在LLaDA-8B模型上,SureLock相较于相同采样器无锁定版本将算法FLOPs降低了30-50%,同时保持了可比的生成质量。我们还提供了理论分析以论证SureLock的设计原理:仅需监控锁定步骤的局部KL散度便足以约束最终词元概率的偏差范围。我们的代码将在https://daioba.github.io/surelock 公开。