While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based continuous diffusion language model (DLM), and construct RePlaid by aligning the architecture of Plaid with modern discrete DLMs. In this unified setting, we establish the first scaling law for continuous DLMs that rivals discrete DLMs: RePlaid exhibits a compute gap of only $20\times$ compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime. We benchmark RePlaid against recent continuous DLMs: on OpenWebText, RePlaid achieves a new state-of-the-art PPL bound of $22.1$ among continuous DLMs and superior generation quality. These results suggest that continuous diffusion, when trained via likelihood, is a highly competitive and scalable alternative to discrete DLMs. Moreover, we offer theoretical insights to understand the advantage of likelihood-based training. We show that optimizing the noise schedule to minimize the ELBO's variance naturally yields linear cross-entropy (information loss) over time. This evenly distributes denoising difficulty without any case-specific time reparameterization. In addition, we find that optimizing embeddings via likelihood creates structured geometries and drives the most significant likelihood gain.
翻译:尽管扩散模型近期引起了语言建模界的广泛关注,但连续扩散的可扩展性似乎不如离散方法。为挑战这一观点,我们重新审视了基于似然的连续扩散语言模型Plaid,并通过将其架构与现代离散扩散语言模型对齐构建了RePlaid。在此统一框架下,我们首次建立了与离散扩散语言模型相媲美的连续扩散语言模型缩放定律:RePlaid相较于自回归模型仅存在20倍的计算差距,在参数量更少的情况下优于Duo,并在过训练阶段优于MDLM。我们将RePlaid与近期连续扩散语言模型进行基准测试:在OpenWebText上,RePlaid在连续扩散语言模型中实现了22.1的最新最优困惑度下界,并展现出更优的生成质量。这些结果表明,通过似然训练的连续扩散是离散扩散语言模型极具竞争力和可扩展性的替代方案。此外,我们提供了理论洞见以理解基于似然训练的优势。我们证明,优化噪声调度以最小化ELBO方差会自然产生随时间线性分布的交叉熵(信息损失)。这在不依赖特定案例的时间重参数化情况下均匀分配了去噪难度。同时,我们发现通过似然优化嵌入会形成结构化几何空间,并带来最显著的似然增益。