Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. Despite their promise, these models typically produce samples whose quality sharply degrades in the few-step regime, preventing a dramatic speedup in practice. Here, we show that language models based on continuous flows over one-hot token embeddings can outperform discrete diffusion in both quality and speed. Importantly, our continuous formulation defines a unique flow map that can be learned directly for efficient few-step inference, a structure we show is unavailable to discrete methods. In this setting, we show that both the flow and its associated flow map can be learned with simple cross-entropy objectives that respect the simplex geometry of the data, and we identify three distinct choices for flow map distillation whose performance we compare in practice. Using these insights, we build a flow language model (FLM), a continuous flow that matches state-of-the-art discrete diffusion baselines on the One Billion Words (LM1B) and OpenWebText (OWT) datasets. We then distill FLM into a flow map language model (FMLM), whose one-step generation exceeds the 8-step quality of recent few-step discrete diffusion language models. Our work challenges the widely-held hypothesis that discrete noising processes are necessary for generative modeling over discrete modalities and paves the way toward accelerated language modeling at scale. Code is available at https://github.com/david3684/flm.
翻译:基于离散扩散的语言模型因具有比自回归模型更快生成的潜力而受到广泛关注。然而,这些模型生成的样本质量通常在少步数机制下急剧下降,从而阻碍了实际中的显著加速。在此,我们表明基于独热词嵌入连续流的语言模型在质量和速度上均能超越离散扩散方法。重要的是,我们的连续公式定义了一个独特的流图,可直接学习以实现高效的少步推理,而这一结构在离散方法中无法实现。在此设定下,我们证明了流及其关联的流图均可通过符合数据单纯形几何结构的简单交叉熵目标来学习,并确定了三种不同的流图蒸馏策略,在实际中比较了它们的性能。利用这些见解,我们构建了流语言模型(FLM),这是一种连续流,在十亿词(LM1B)和OpenWebText(OWT)数据集上达到了与最先进离散扩散基线相当的性能。随后,我们将FLM蒸馏为流图语言模型(FMLM),其单步生成质量超过了近期少步离散扩散语言模型的8步生成质量。我们的工作挑战了"离散模态生成建模必须依赖离散噪声过程"这一广泛假设,并为大规模加速语言建模铺平了道路。代码开源于https://github.com/david3684/flm。