Masked diffusion models (MDMs) have shown promise in language modeling, yet their scalability and effectiveness in core language tasks, such as text generation and language understanding, remain underexplored. This paper establishes the first scaling law for MDMs, demonstrating a scaling rate comparable to autoregressive models (ARMs) and a relatively small compute gap. Motivated by their scalability, we train a family of MDMs with up to 1.1 billion (B) parameters to systematically evaluate their performance against ARMs of comparable or larger sizes. Fully leveraging the probabilistic formulation of MDMs, we propose a simple yet effective unsupervised classifier-free guidance that effectively exploits large-scale unpaired data, boosting performance for conditional inference. In language understanding, the 1.1B MDM outperforms the 1.1B TinyLlama model trained on the same data across four of eight zero-shot benchmarks. Notably, it achieves competitive math reasoning ability with the 7B Llama-2 model on the GSM8K dataset. In text generation, MDMs with 16 times more pre-training time offer a flexible trade-off against ARMs with the accelerated sampling technique KV-Cache: MDMs match ARMs in performance while being 1.4 times faster during sampling. Moreover, MDMs address challenging tasks for ARMs by effectively handling bidirectional reasoning and adapting to temporal shifts in data. Notably, a 1.1B MDM breaks the reverse curse encountered by much larger ARMs with significantly more data and computation, such as 13B Llama-2 and 175B GPT-3. Our code is available at https://github.com/ML-GSAI/SMDM.
翻译:掩码扩散模型在语言建模中展现出潜力,但其在文本生成和语言理解等核心语言任务中的可扩展性与有效性尚未得到充分探索。本文首次建立了掩码扩散模型的缩放定律,证明其缩放速率与自回归模型相当,且计算差距相对较小。基于其可扩展性,我们训练了参数量高达11亿的掩码扩散模型系列,系统评估其与规模相当或更大的自回归模型的性能对比。充分利用掩码扩散模型的概率建模框架,我们提出了一种简单有效的无监督无分类器引导方法,有效利用大规模非配对数据,显著提升了条件推理性能。在语言理解任务中,11亿参数的掩码扩散模型在八个零样本基准测试中的四项表现优于基于相同数据训练的11亿参数TinyLlama模型。值得注意的是,在GSM8K数据集上,其数学推理能力与70亿参数的Llama-2模型相当。在文本生成任务中,经过16倍预训练时长的掩码扩散模型通过加速采样技术KV-Cache实现了与自回归模型的灵活权衡:在性能相当的同时,采样速度提升1.4倍。此外,掩码扩散模型通过有效处理双向推理和适应数据的时间分布变化,解决了自回归模型面临的挑战性任务。值得关注的是,11亿参数的掩码扩散模型突破了需要更大量数据和计算资源的自回归模型(如130亿参数的Llama-2和1750亿参数的GPT-3)所遭遇的逆向诅咒问题。我们的代码公开于https://github.com/ML-GSAI/SMDM。