Despite their groundbreaking performance for many generative modeling tasks, diffusion models have fallen short on discrete data domains such as natural language. Crucially, standard diffusion models rely on the well-established theory of score matching, but efforts to generalize this to discrete structures have not yielded the same empirical gains. In this work, we bridge this gap by proposing score entropy, a novel loss that naturally extends score matching to discrete spaces, integrates seamlessly to build discrete diffusion models, and significantly boosts performance. Experimentally, we test our Score Entropy Discrete Diffusion models (SEDD) on standard language modeling tasks. For comparable model sizes, SEDD beats existing language diffusion paradigms (reducing perplexity by $25$-$75$\%) and is competitive with autoregressive models, in particular outperforming GPT-2. Furthermore, compared to autoregressive mdoels, SEDD generates faithful text without requiring distribution annealing techniques like temperature scaling (around $6$-$8\times$ better generative perplexity than un-annealed GPT-2), can trade compute and quality (similar quality with $32\times$ fewer network evaluations), and enables controllable infilling (matching nucleus sampling quality while enabling other strategies besides left to right prompting).
翻译:尽管扩散模型在许多生成建模任务中取得了突破性性能,但在自然语言等离散数据领域却表现不足。关键原因在于,标准扩散模型依赖于成熟的分数匹配理论,但将该理论推广到离散结构的尝试尚未带来同等的实证收益。本研究通过提出分数熵这一新颖损失函数来弥合这一差距:该损失函数自然地将分数匹配扩展到离散空间,可无缝集成以构建离散扩散模型,并显著提升性能。我们在标准语言建模任务上测试了所提出的分数熵离散扩散模型(SEDD)。在模型规模相当的条件下,SEDD超越了现有语言扩散范式(将困惑度降低$25$-$75\%$),并与自回归模型形成竞争,尤其优于GPT-2。此外,相较于自回归模型,SEDD无需依赖温度缩放等分布退火技术即可生成忠实文本(其生成困惑度比未退火的GPT-2提升约$6$-$8$倍),能够权衡计算量与生成质量(以$32$倍少的网络评估次数达到相近质量),并支持可控文本填充(在保持与核心采样相当质量的同时,支持除从左到右提示外的其他生成策略)。