Continuous diffusion models have achieved strong performance across domains such as images. However, in language modeling, prior continuous diffusion language models (DLMs) lag behind discrete counterparts. In this work, we close this gap with LangFlow, the first continuous DLM to rival discrete diffusion. Our approach connects embedding-space DLMs to Flow Matching via Bregman divergence and introduces three key innovations: (1) a novel ODE-based NLL bound for principled evaluation of continuous flow-based language models; (2) an information-uniform principle for noise scheduling, motivating a learnable scheduler based on a Gumbel distribution; and (3) an improved training protocol incorporating self-conditioning, which enhances both likelihood and sample quality.LangFlow achieves strong performance across benchmarks, reaching a perplexity (PPL) of 30.0 on LM1B and 24.6 on OpenWebText. It matches top discrete DLMs at comparable scale and surpasses autoregressive baselines in zero-shot transfer across multiple benchmarks. LangFlow provides clear evidence that continuous diffusion is a competitive and promising paradigm for language modeling. https://github.com/nealchen2003/LangFlow
翻译:连续扩散模型在图像等领域已取得优异表现。然而,在语言建模中,现有的连续扩散语言模型始终落后于离散模型。本研究通过LangFlow填补了这一差距——这是首个能与离散扩散相抗衡的连续扩散语言模型。我们的方法通过Bregman散度将嵌入空间扩散语言模型与流匹配技术相结合,并引入三项关键创新:(1)基于ODE的负对数似然边界,为连续流式语言模型提供严谨评估框架;(2)噪声调度的信息均匀原则,由此推导出基于Gumbel分布的可学习调度器;(3)融入自调节机制的改进训练协议,同时提升似然度和样本质量。LangFlow在多个基准测试中表现优异,在LM1B和OpenWebText数据集上分别达到30.0和24.6的困惑度。在同等规模下,该模型性能与顶尖离散扩散语言模型持平,并在多个跨基准零样本迁移测试中超越自回归基线模型。LangFlow为连续扩散作为语言建模领域富有竞争力且前景广阔的研究范式提供了有力证据。https://github.com/nealchen2003/LangFlow