In this paper, we introduce a novel dIffusion language modEl pre-training framework for text generation, which we call GENIE. GENIE is a large-scale pretrained diffusion language model that consists of an encoder and a diffusion-based decoder, which can generate text by gradually transforming a random noise sequence into a coherent text sequence. To pre-train GENIE on a large-scale language corpus, we design a new continuous paragraph denoise objective, which encourages the diffusion-decoder to reconstruct a clean text paragraph from a corrupted version, while preserving the semantic and syntactic coherence. We evaluate GENIE on four downstream text generation benchmarks, namely XSum, CNN/DailyMail, Gigaword, and CommonGen. Our experimental results show that GENIE achieves comparable performance with the state-of-the-art autoregressive models on these benchmarks, and generates more diverse text samples. The code and models of GENIE are available at https://github.com/microsoft/ProphetNet/tree/master/GENIE.
翻译:在本文中,我们提出了一种新颖的扩散语言模型预训练框架用于文本生成,称为GENIE。GENIE是一个大规模预训练的扩散语言模型,由编码器和基于扩散的解码器组成,通过将随机噪声序列逐步转化为连贯的文本序列来生成文本。为了在大规模语言语料上预训练GENIE,我们设计了一种新的连续段落去噪目标函数,该函数鼓励扩散解码器从受损版本重建干净的文本段落,同时保持语义和句法连贯性。我们在四个下游文本生成基准测试(即XSum、CNN/DailyMail、Gigaword和CommonGen)上评估了GENIE。实验结果表明,GENIE在这些基准测试上达到了与最先进的自回归模型相当的性能,并生成了更多样化的文本样本。GENIE的代码和模型可在https://github.com/microsoft/ProphetNet/tree/master/GENIE获取。