Continuous diffusion has been the foundation of high-fidelity, controllable, and few-step generation of many data modalities such as images. However, in language modeling, prior continuous diffusion language models (DLMs) lag behind discrete counterparts due to the sparse data space and the underexplored design space. In this work, we close this gap with LangFlow, the first continuous DLM to rival discrete diffusion, by connecting embedding-space DLMs to Flow Matching via Bregman divergence, alongside three key innovations: (1) we derive a novel ODE-based NLL bound for principled evaluation of continuous flow-based language models; (2) we propose an information-uniform principle for setting the noise schedule, which motivates a learnable noise scheduler based on a Gumbel distribution; and (3) we revise prior training protocols by incorporating self-conditioning, as we find it improves both likelihood and sample quality of embedding-space DLMs with effects substantially different from discrete diffusion. Putting everything together, LangFlow rivals top discrete DLMs on both the perplexity (PPL) and the generative perplexity (Gen. PPL), reaching a PPL of 30.0 on LM1B and 24.6 on OpenWebText. It even exceeds autoregressive baselines in zero-shot transfer on 4 out of 7 benchmarks. LangFlow provides the first clear evidence that continuous diffusion is a promising paradigm for language modeling. Homepage: https://github.com/nealchen2003/LangFlow
翻译:[翻译的摘要] 连续扩散技术已成为图像等多模态数据高质量、可控制且少步骤生成的基础。然而在语言建模领域,由于稀疏的数据空间和未被充分探索的设计空间,先前的连续扩散语言模型落后于离散扩散模型。本研究通过连接嵌入空间扩散语言模型与基于布雷格曼散度的流匹配,并结合三项关键创新——(1)推导出基于常微分方程的负对数似然界,为基于连续流的语言模型提供原则性评估;(2)提出信息均匀原则来设定噪声调度,并据此设计基于Gumbel分布的可学习噪声调度器;(3)修正先前的训练协议,纳入自条件化机制——发现其对嵌入空间扩散语言模型的似然度和样本质量均有提升作用,且与离散扩散效果显著不同——最终推出LangFlow,首个能与离散扩散相抗衡的连续扩散语言模型。综合所有改进,LangFlow在困惑度与生成困惑度上均达到顶级离散扩散语言模型水平,在LM1B数据集上取得30.0的困惑度,在OpenWebText数据集上取得24.6的困惑度。在7项基准测试的零样本迁移中,其甚至超过自回归基线模型中的4项。LangFlow首次明确证明连续扩散是适用于语言建模的具有前景的范式。官网:https://github.com/nealchen2003/LangFlow