Diffusion language models (DLMs) have recently emerged as a strong alternative to autoregressive models by enabling parallel text generation. To improve inference efficiency and KV-cache compatibility, prior work commonly adopts block-based diffusion, decoding tokens block by block. However, this paradigm suffers from a structural limitation that we term Boundary-Induced Context Truncation (BICT): undecoded tokens near block boundaries are forced to commit without access to nearby future context, even when such context could substantially reduce uncertainty. This limitation degrades decoding certainty and generation quality, especially for tasks requiring precise reasoning, such as mathematical problem solving and code generation. We propose Deferred Commitment Decoding (DCD), a novel, training-free decoding strategy that mitigates this issue. DCD maintains a certainty-aware sliding window over masked tokens, resolving low-uncertainty tokens early while deferring high-uncertainty tokens until sufficient contextual evidence becomes available. Extensive experiments across multiple diffusion language models, benchmarks, and caching configurations show that DCD improves generation accuracy by 1.73% with comparable time on average compared to fixed block-based diffusion methods, with the most significant improvement reaching 16.5%. These results demonstrate that deferring token commitment based on uncertainty is a simple yet effective principle for improving both the quality and efficiency of diffusion language model decoding.
翻译:扩散语言模型(DLMs)最近作为一种强大的替代方案出现,通过实现并行文本生成来挑战自回归模型。为提高推理效率和KV缓存兼容性,先前工作通常采用基于块的扩散方法,逐块解码令牌。然而,这种范式存在一种我们称为边界诱导上下文截断(BICT)的结构性限制:位于块边界附近的未解码令牌被迫在无法访问附近未来上下文的情况下做出承诺,即使这些上下文可能显著降低不确定性。这种限制会降低解码确定性和生成质量,尤其对于需要精确推理的任务,如数学问题求解和代码生成。我们提出延迟承诺解码(DCD),一种新颖的无需训练的解码策略以缓解此问题。DCD在掩码令牌上维护一个确定性感知滑动窗口,早期解析低不确定性令牌,同时延迟高不确定性令牌直到获得足够的上下文证据。在多种扩散语言模型、基准测试和缓存配置上的大量实验表明,与固定块扩散方法相比,DCD在平均时间相当的情况下将生成准确率提高了1.73%,最大改进达到16.5%。这些结果表明,基于不确定性延迟令牌承诺是提升扩散语言模型解码质量和效率的简单而有效的原则。