Diffusion language models have emerged as a promising approach for text generation. One would naturally expect this method to be an efficient replacement for autoregressive models since multiple tokens can be sampled in parallel during each diffusion step. However, its efficiency-accuracy trade-off is not yet well understood. In this paper, we present a rigorous theoretical analysis of a widely used type of diffusion language model, the Masked Diffusion Model (MDM), and find that its effectiveness heavily depends on the target evaluation metric. Under mild conditions, we prove that when using perplexity as the metric, MDMs can achieve near-optimal perplexity in sampling steps regardless of sequence length, demonstrating that efficiency can be achieved without sacrificing performance. However, when using the sequence error rate--which is important for understanding the "correctness" of a sequence, such as a reasoning chain--we show that the required sampling steps must scale linearly with sequence length to obtain "correct" sequences, thereby eliminating MDM's efficiency advantage over autoregressive models. Our analysis establishes the first theoretical foundation for understanding the benefits and limitations of MDMs. All theoretical findings are supported by empirical studies.
翻译:扩散语言模型已成为文本生成领域一种前景广阔的方法。人们很自然地期望这种方法能成为自回归模型的高效替代方案,因为在每个扩散步骤中可以并行采样多个标记。然而,其效率与准确性的权衡关系尚未得到充分理解。本文对一类广泛使用的扩散语言模型——掩码扩散模型(MDM)进行了严格的理论分析,发现其有效性在很大程度上取决于目标评估指标。在温和条件下,我们证明当使用困惑度作为评估指标时,无论序列长度如何,MDM都能在采样步骤中达到接近最优的困惑度,这表明效率可以在不牺牲性能的情况下实现。然而,当使用序列错误率(这对理解序列的“正确性”至关重要,例如推理链)作为评估指标时,我们证明要获得“正确”序列,所需采样步骤必须随序列长度线性增长,从而消除了MDM相对于自回归模型的效率优势。我们的分析为理解MDM的优势与局限性建立了首个理论基础。所有理论发现均得到了实证研究的支持。