Latent diffusion models offer an attractive alternative to discrete diffusion for non-autoregressive text generation by operating on continuous text representations and denoising entire sequences in parallel. The major challenge in latent diffusion modeling is constructing a suitable latent space. In this work, we present the Latent Diffusion Language Model (LDLM), in which the latent encoder, diffusion model, and decoder are trained jointly. LDLM builds its latent space by reshaping the representations of a pre-trained language model with a trainable encoder, yielding latents that are easy to both denoise and decode into tokens. We show that naive joint training produces a low-quality diffusion model, and propose a simple training recipe consisting of an MSE decoder loss, diffusion-to-encoder warmup, adaptive timestep sampling, and decoder-input noise. Ablations show that each component substantially impacts generation performance. On OpenWebText and LM1B, LDLM achieves better generation performance than existing discrete and continuous diffusion language models while being $2{\text -}13\times$ faster, indicating that jointly learning the latent space is a key step toward making latent diffusion competitive for text generation.
翻译:潜在扩散模型通过在连续文本表示上并行去噪整个序列,为自回归文本生成提供了离散扩散的有吸引力的替代方案。潜在扩散建模的主要挑战是构建合适的潜在空间。本文提出了潜在扩散语言模型(LDLM),其中潜在编码器、扩散模型和解码器被联合训练。LDLM通过可训练编码器重塑预训练语言模型的表示来构建其潜在空间,生成的潜在变量既容易去噪又能解码为标记。我们表明,朴素联合训练会产生低质量的扩散模型,并提出了一种简单的训练方案,包括MSE解码器损失、扩散到编码器的预热、自适应时间步采样和解码器输入噪声。消融实验表明,每个组件都显著影响生成性能。在OpenWebText和LM1B上,LDLM在实现$2{\text-}13\times$加速的同时,取得了比现有离散和连续扩散语言模型更好的生成性能,这表明联合学习潜在空间是使潜在扩散在文本生成中具有竞争力的关键步骤。