Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped transformers or continuous chain-of-thoughts, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped transformers lack, they introduce additional difficulty decoding tokens into the discrete token space from the continuous representation space. We therefore propose Coevolutionary Continuous Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on real-world tasks.
翻译:扩散语言模型,特别是掩码离散扩散模型,近年来取得了巨大成功。尽管存在一些理论和初步实证结果表明,通过循环Transformer或连续思维链进行潜在推理具有优势,但连续扩散模型通常性能逊于其离散对应模型。本文中,我们认为扩散语言模型未必需要局限于离散空间。具体而言,我们证明了连续扩散模型比离散扩散和循环Transformer具有更强的表达能力。我们将理论表达能力与实证性能之间的矛盾归因于它们在实践中的可训练性:虽然连续扩散提供了循环Transformer所缺乏的中间监督,但它引入了从连续表示空间向离散词元空间解码词元的额外困难。因此,我们提出协同演化连续离散扩散(CCDD),该方法在连续表示空间与离散词元空间的并集上定义了一个联合多模态扩散过程,利用单一模型在联合空间中同时进行去噪。通过结合两种模态,CCDD在潜在空间中具备丰富的语义表达能力,同时借助显式离散词元获得了良好的可训练性和样本质量。我们还提出了CCDD的有效架构和高级训练/采样技术,在真实世界任务的广泛语言建模实验中展现了强大的实证性能。