Continuous diffusion has been the foundation of high-fidelity, controllable, and few-step generation of many data modalities such as images. However, in language modeling, prior continuous diffusion language models (DLMs) lag behind discrete counterparts due to the sparse data space and the underexplored design space. In this work, we close this gap with LangFlow, the first continuous DLM to rival discrete diffusion, by connecting embedding-space DLMs to Flow Matching via Bregman divergence, alongside three key innovations: (1) we derive a novel ODE-based NLL bound for principled evaluation of continuous flow-based language models; (2) we propose an information-uniform principle for setting the noise schedule, which motivates a learnable noise scheduler based on a Gumbel distribution; and (3) we revise prior training protocols by incorporating self-conditioning, as we find it improves both likelihood and sample quality of embedding-space DLMs with effects substantially different from discrete diffusion. Putting everything together, LangFlow rivals top discrete DLMs on both the perplexity (PPL) and the generative perplexity (Gen. PPL), reaching a PPL of 30.0 on LM1B and 24.6 on OpenWebText. It even exceeds autoregressive baselines in zero-shot transfer on 4 out of 7 benchmarks. LangFlow provides the first clear evidence that continuous diffusion is a promising paradigm for language modeling. Homepage: https://github.com/nealchen2003/LangFlow
翻译:连续扩散一直是高保真、可控、少步生成图像等多种数据模态的基础。然而,在语言建模中,由于稀疏的数据空间和未被充分探索的设计空间,先前的连续扩散语言模型(DLM)落后于离散模型。在本工作中,我们通过将嵌入空间DLM与Bregman散度驱动的流匹配相结合,并引入三项关键创新,凭借LangFlow填补了这一差距——这是首个与离散扩散相抗衡的连续DLM:(1)我们推导了一个基于ODE的新型NLL边界,用于对连续流基语言模型进行原则性评估;(2)我们提出了一种基于信息均匀原则的噪声调度设定方法,并由此激发出一种基于Gumbel分布的可学习噪声调度器;(3)我们修正了先前的训练协议,引入自条件化,发现该方法能以与离散扩散截然不同的方式显著改善嵌入空间DLM的似然和样本质量。综合而言,LangFlow在困惑度(PPL)和生成困惑度(Gen. PPL)上均可媲美顶级离散DLM,在LM1B上达到30.0,在OpenWebText上达到24.6,甚至在7个基准中的4个零样本迁移任务中超越了自回归基线。LangFlow首次提供了明确证据,表明连续扩散是语言建模中一条有前景的范式。主页:https://github.com/nealchen2003/LangFlow