Confidence-Based Decoding is Provably Efficient for Diffusion Language Models

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models for language modeling, allowing flexible generation order and parallel generation of multiple tokens. However, this flexibility introduces a challenge absent in AR models: the \emph{decoding strategy} -- which determines the order and number of tokens generated at each iteration -- critically affects sampling efficiency. Among decoding strategies explored in practice, confidence-based methods, which adaptively select which and how many tokens to unmask based on prediction confidence, have shown strong empirical performance. Despite this success, our theoretical understanding of confidence-based decoding remains limited. In this work, we develop the first theoretical analysis framework for confidence-based decoding in DLMs. We focus on an entropy sum-based strategy that continues unmasking tokens within each iteration until the cumulative entropy exceeds a threshold, and show that it achieves $\varepsilon$-accurate sampling in KL divergence with an expected number of iterations $\widetilde O(H(X_0)/\varepsilon)$, where $H(X_0)$ denotes the entropy of the target data distribution. Notably, this strategy yields substantial sampling acceleration when the data distribution has low entropy relative to the sequence length, while automatically adapting to the intrinsic complexity of data without requiring prior knowledge or hyperparameter tuning. Overall, our results provide a theoretical foundation for confidence-based decoding and may inform the design of more efficient decoding strategies for DLMs.

翻译：扩散语言模型（DLMs）已成为自回归（AR）模型在语言建模领域的一种有前景的替代方案，支持灵活的生成顺序和多个令牌的并行生成。然而，这种灵活性引入了一个AR模型所不具备的挑战：*解码策略*——它决定了每次迭代中生成的令牌顺序和数量——对采样效率具有关键影响。在实践探索的解码策略中，基于置信度的方法（根据预测置信度自适应地选择哪些令牌以及多少令牌进行去掩码）已展现出强大的实证性能。尽管取得了这一成功，我们对基于置信度解码的理论理解仍然有限。在本工作中，我们首次为DLMs中的基于置信度解码建立了理论分析框架。我们重点关注一种基于熵和的策略，该策略在每次迭代中持续进行令牌去掩码，直到累积熵超过阈值，并证明该方法能在KL散度下实现$\varepsilon$-精确采样，其期望迭代次数为$\widetilde O(H(X_0)/\varepsilon)$，其中$H(X_0)$表示目标数据分布的熵。值得注意的是，当数据分布相对于序列长度具有较低熵时，该策略能显著加速采样，同时自动适应数据的内在复杂度，无需先验知识或超参数调优。总体而言，我们的结果为基于置信度的解码提供了理论基础，并可能为设计更高效的DLMs解码策略提供指导。