Despite their groundbreaking performance for many generative modeling tasks, diffusion models have fallen short on discrete data domains such as natural language. Crucially, standard diffusion models rely on the well-established theory of score matching, but efforts to generalize this to discrete structures have not yielded the same empirical gains. In this work, we bridge this gap by proposing score entropy, a novel discrete score matching loss that is more stable than existing methods, forms an ELBO for maximum likelihood training, and can be efficiently optimized with a denoising variant. We scale our Score Entropy Discrete Diffusion models (SEDD) to the experimental setting of GPT-2, achieving highly competitive likelihoods while also introducing distinct algorithmic advantages. In particular, when comparing similarly sized SEDD and GPT-2 models, SEDD attains comparable perplexities (normally within $+10\%$ of and sometimes outperforming the baseline). Furthermore, SEDD models learn a more faithful sequence distribution (around $4\times$ better compared to GPT-2 models with ancestral sampling as measured by large models), can trade off compute for generation quality (needing only $16\times$ fewer network evaluations to match GPT-2), and enables arbitrary infilling beyond the standard left to right prompting.
翻译:尽管扩散模型在许多生成建模任务中展现了突破性性能,但在自然语言等离散数据领域仍存在不足。关键问题在于,标准扩散模型依赖于成熟的分数匹配理论,但将其推广到离散结构时未能带来同等实证收益。本研究通过提出分数熵(score entropy)——一种比现有方法更稳定的新型离散分数匹配损失函数——弥合了这一差距。该损失函数可构成最大似然训练的证据下界(ELBO),并可通过去噪变体高效优化。我们将分数熵离散扩散模型(SEDD)扩展至GPT-2的实验规模,在实现极具竞争力的似然度的同时,还引入了独特的算法优势。具体而言,在对比规模相当的SEDD与GPT-2模型时,SEDD取得了可比的困惑度(通常超出基线值约$+10\%$,有时甚至表现更优)。此外,SEDD模型能学习到更准确的序列分布(以大模型评估为基准,相比采用祖先采样的GPT-2模型提升约$4\times$),可在计算开销与生成质量之间灵活权衡(仅需$16\times$更少的网络评估次数即可匹配GPT-2表现),并支持超越标准从左到右提示的任意文本填充任务。