Scaling up Masked Diffusion Models on Text

Masked diffusion models (MDMs) have shown promise in language modeling, yet their scalability and effectiveness in core language tasks, such as text generation and language understanding, remain underexplored. This paper establishes the first scaling law for MDMs, demonstrating a scaling rate comparable to autoregressive models (ARMs) and a relatively small compute gap. Motivated by their scalability, we train a family of MDMs with up to 1.1 billion (B) parameters to systematically evaluate their performance against ARMs of comparable or larger sizes. Fully leveraging the probabilistic formulation of MDMs, we propose a simple yet effective unsupervised classifier-free guidance that effectively exploits large-scale unpaired data, boosting performance for conditional inference. In language understanding, the 1.1B MDM outperforms the 1.1B TinyLlama model trained on the same data across four of eight zero-shot benchmarks. Notably, it achieves competitive math reasoning ability with the 7B Llama-2 model on the GSM8K dataset. In text generation, MDMs with 16 times more pre-training time offer a flexible trade-off against ARMs with the accelerated sampling technique KV-Cache: MDMs match ARMs in performance while being 1.4 times faster during sampling. Moreover, MDMs address challenging tasks for ARMs by effectively handling bidirectional reasoning and adapting to temporal shifts in data. Notably, a 1.1B MDM breaks the reverse curse encountered by much larger ARMs with significantly more data and computation, such as 13B Llama-2 and 175B GPT-3. Our code is available at https://github.com/ML-GSAI/SMDM.

翻译：掩码扩散模型在语言建模中展现出潜力，但其在文本生成和语言理解等核心语言任务中的可扩展性与有效性尚未得到充分探索。本文首次建立了掩码扩散模型的缩放定律，证明其缩放速率与自回归模型相当，且计算差距相对较小。基于其可扩展性，我们训练了参数量高达11亿的掩码扩散模型系列，系统评估其与规模相当或更大的自回归模型的性能对比。充分利用掩码扩散模型的概率建模框架，我们提出了一种简单有效的无监督无分类器引导方法，有效利用大规模非配对数据，显著提升了条件推理性能。在语言理解任务中，11亿参数的掩码扩散模型在八个零样本基准测试中的四项表现优于基于相同数据训练的11亿参数TinyLlama模型。值得注意的是，在GSM8K数据集上，其数学推理能力与70亿参数的Llama-2模型相当。在文本生成任务中，经过16倍预训练时长的掩码扩散模型通过加速采样技术KV-Cache实现了与自回归模型的灵活权衡：在性能相当的同时，采样速度提升1.4倍。此外，掩码扩散模型通过有效处理双向推理和适应数据的时间分布变化，解决了自回归模型面临的挑战性任务。值得关注的是，11亿参数的掩码扩散模型突破了需要更大量数据和计算资源的自回归模型（如130亿参数的Llama-2和1750亿参数的GPT-3）所遭遇的逆向诅咒问题。我们的代码公开于https://github.com/ML-GSAI/SMDM。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日