Diffusion and flow-based models have become the de facto approaches for generating continuous data, e.g., in domains such as images and videos. Their success has attracted growing interest in applying them to language modeling. Unlike their image-domain counterparts, today's leading diffusion language models (DLMs) primarily operate over discrete tokens. In this paper, we show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. We propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps. These results suggest that ELF offers a promising path toward effective continuous DLMs.
翻译:扩散模型与基于流的模型已成为生成连续数据(如图像和视频领域)的事实标准方法。其成功引发了将其应用于语言建模的日益增长的兴趣。与图像领域的同类模型不同,当今领先的扩散语言模型主要基于离散令牌运作。在本文中,我们证明,通过最小化对离散域的调整,连续的扩散语言模型也能变得有效。我们提出嵌入语言流(ELF),这是一类基于连续时间流匹配的连续嵌入空间扩散模型。与现有扩散语言模型不同,ELF在最终时间步之前主要停留在连续嵌入空间中,并在最终时间步使用共享权重网络映射到离散令牌。这种表述使得可以直接借鉴图像领域扩散模型的成熟技术,例如无分类器引导。实验表明,ELF显著优于领先的离散和连续扩散语言模型,以更少的采样步骤实现了更优的生成质量。这些结果表明,ELF为构建有效的连续扩散语言模型提供了一条有前景的路径。