While diffusion models excel at generating high-quality images, prior work reports a significant performance gap between diffusion and autoregressive (AR) methods in language modeling. In this work, we show that simple masked discrete diffusion is more performant than previously thought. We apply an effective training recipe that improves the performance of masked diffusion models and derive a simplified, Rao-Blackwellized objective that results in additional improvements. Our objective has a simple form -- it is a mixture of classical masked language modeling losses -- and can be used to train encoder-only language models that admit efficient samplers, including ones that can generate arbitrary lengths of text semi-autoregressively like a traditional language model. On language modeling benchmarks, a range of masked diffusion models trained with modern engineering practices achieves a new state-of-the-art among diffusion models, and approaches AR perplexity. We provide the code, along with a blog post and video tutorial on the project page: https://s-sahoo.com/mdlm
翻译:尽管扩散模型在生成高质量图像方面表现出色,但先前的研究报告指出,在语言建模任务中,扩散模型与自回归方法之间存在显著的性能差距。本研究表明,简单的掩码离散扩散模型比先前认为的更具性能潜力。我们应用了一种有效的训练方案,提升了掩码扩散模型的性能,并推导出一种简化的Rao-Blackwellized目标函数,进一步带来了性能改进。我们的目标函数形式简洁——它是经典掩码语言建模损失的混合体——可用于训练仅包含编码器的语言模型,这些模型支持高效的采样器,包括能够像传统语言模型那样以半自回归方式生成任意长度文本的采样器。在语言建模基准测试中,一系列采用现代工程实践训练的掩码扩散模型在扩散模型中达到了新的最优性能,并接近自回归模型的困惑度。我们在项目页面(https://s-sahoo.com/mdlm)上提供了代码、博客文章和视频教程。