The recent surge of generative AI has been fueled by the generative power of diffusion probabilistic models and the scalable capabilities of large language models. Despite their potential, it remains elusive whether diffusion language models can solve general language tasks comparable to their autoregressive counterparts. This paper demonstrates that scaling diffusion models w.r.t. data, sizes, and tasks can effectively make them strong language learners. We build competent diffusion language models at scale by first acquiring knowledge from massive data via masked language modeling pretraining thanks to their intrinsic connections. We then reprogram pretrained masked language models into diffusion language models via diffusive adaptation, wherein task-specific finetuning and instruction finetuning are explored to unlock their versatility in solving general language tasks. Experiments show that scaling diffusion language models consistently improves performance across downstream language tasks. We further discover that instruction finetuning can elicit zero-shot and few-shot in-context learning abilities that help tackle many unseen tasks by following natural language instructions, and show promise in advanced and challenging abilities such as reasoning.
翻译:生成式人工智能的近期迅猛发展得益于扩散概率模型的生成能力与大型语言模型的可扩展性。尽管潜力巨大,扩散语言模型是否能够像自回归语言模型一样解决通用语言任务仍不明确。本文证明,在数据量、模型规模和任务类型方面对扩散模型进行缩放,可有效使其成为强大的语言学习者。我们首先通过掩码语言建模预训练(利用其内在联系)从海量数据中获取知识,从而构建大规模高性能扩散语言模型。随后,通过扩散性适配将预训练的掩码语言模型重编程为扩散语言模型,并探索任务特定微调与指令微调以释放其在解决通用语言任务方面的多功能性。实验表明,缩放扩散语言模型可持续提升下游语言任务的性能。我们进一步发现,指令微调能够激发零样本和少样本的上下文学习能力,帮助模型通过遵循自然语言指令处理众多未见任务,并在推理等高级与挑战性能力方面展现出前景。