Masked diffusion models (MDMs) have emerged as a popular research topic for generative modeling of discrete data, thanks to their superior performance over other discrete diffusion models, and are rivaling the auto-regressive models (ARMs) for language modeling tasks. The recent effort in simplifying the masked diffusion framework further leads to alignment with continuous-space diffusion models and more principled training and sampling recipes. In this paper, however, we reveal that both training and sampling of MDMs are theoretically free from the time variable, arguably the key signature of diffusion models, and are instead equivalent to masked models. The connection on the sampling aspect is drawn by our proposed first-hitting sampler (FHS). Specifically, we show that the FHS is theoretically equivalent to MDMs' original generation process while significantly alleviating the time-consuming categorical sampling and achieving a 20$\times$ speedup. In addition, our investigation raises doubts about whether MDMs can truly beat ARMs in text generation. We identify, for the first time, an underlying numerical issue, even with the commonly used 32-bit floating-point precision, which results in inaccurate categorical sampling. We show that it lowers the effective temperature both theoretically and empirically, and the resulting decrease in token diversity makes previous evaluations, which assess the generation quality solely through the incomplete generative perplexity metric, somewhat unfair.
翻译:掩码扩散模型已成为离散数据生成建模的热门研究方向,其性能优于其他离散扩散模型,并在语言建模任务中可与自回归模型相媲美。近期简化掩码扩散框架的努力进一步实现了与连续空间扩散模型的对齐,并形成了更规范化的训练与采样方案。然而,本文揭示掩码扩散模型的训练与采样过程在理论上均独立于时间变量——这一扩散模型的关键特征——实际上等价于掩码模型。采样层面的关联性通过我们提出的首达采样器得以建立。具体而言,我们证明首达采样器在理论上等价于掩码扩散模型的原始生成过程,同时显著缓解了耗时的分类采样问题,实现了20倍的加速效果。此外,我们的研究对掩码扩散模型是否真正能在文本生成任务中超越自回归模型提出了质疑。我们首次发现,即使在常用的32位浮点精度下,仍存在潜在数值问题导致分类采样不准确。理论分析与实验结果表明,该问题会降低有效温度,由此引发的词汇多样性下降使得先前仅通过不完整的生成困惑度指标评估生成质量的方法存在一定偏颇。