Despite their groundbreaking performance for many generative modeling tasks, diffusion models have fallen short on discrete data domains such as natural language. Crucially, standard diffusion models rely on the well-established theory of score matching, but efforts to generalize this to discrete structures have not yielded the same empirical gains. In this work, we bridge this gap by proposing score entropy, a novel loss that naturally extends score matching to discrete spaces, integrates seamlessly to build discrete diffusion models, and significantly boosts performance. Experimentally, we test our Score Entropy Discrete Diffusion models (SEDD) on standard language modeling tasks. For comparable model sizes, SEDD beats existing language diffusion paradigms (reducing perplexity by $25$-$75$\%) and is competitive with autoregressive models, in particular outperforming GPT-2. Furthermore, compared to autoregressive mdoels, SEDD generates faithful text without requiring distribution annealing techniques like temperature scaling (around $6$-$8\times$ better generative perplexity than un-annealed GPT-2), can trade compute and quality (similar quality with $32\times$ fewer network evaluations), and enables controllable infilling (matching nucleus sampling quality while enabling other strategies besides left to right prompting).
翻译:尽管扩散模型在许多生成建模任务中展现出突破性性能,但在自然语言等离散数据领域却表现不足。关键问题在于,标准扩散模型依赖于成熟的得分匹配理论,但将该理论推广至离散结构的尝试未能获得同等实证效果。本文通过提出得分熵这一新型损失函数弥合了这一差距,该函数自然地扩展了得分匹配到离散空间的能力,可无缝集成构建离散扩散模型,并显著提升性能。实验方面,我们在标准语言建模任务上测试了所提出的得分熵离散扩散模型(SEDD)。在同等模型规模下,SEDD优于现有语言扩散范式(困惑度降低25%-75%),并与自回归模型具有竞争力,特别在超越GPT-2方面表现突出。此外,与自回归模型相比,SEDD无需温度缩放等分布退火技术即可生成忠实文本(生成困惑度较未退火GPT-2提升约6-8倍),可权衡计算量与生成质量(在32倍网络评估次数减少条件下达到相似质量),并支持可控文本填充(在实现除从左至右提示之外的其他策略时,匹配核心采样质量)。