Pre-trained models have achieved remarkable success in natural language processing (NLP). However, existing pre-training methods underutilize the benefits of language understanding for generation. Inspired by the idea of Generative Adversarial Networks (GANs), we propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator, unifying the ability of language understanding and generation in a single model. Our model, named as GanLM, is trained with two pre-training objectives: replaced token detection and replaced token denoising. Specifically, given masked source sentences, the generator outputs the target distribution and the discriminator predicts whether the target sampled tokens from distribution are incorrect. The target sentence is replaced with misclassified tokens to construct noisy previous context, which is used to generate the gold sentence. In general, both tasks improve the ability of language understanding and generation by selectively using the denoising data. Extensive experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models (PLMs) and achieves state-of-the-art performance.
翻译:预训练模型在自然语言处理(NLP)领域取得了显著成功。然而,现有预训练方法未能充分利用语言理解对生成任务的增益。受生成对抗网络(GANs)思想启发,我们通过引入辅助判别器,提出了一种适用于编码器-解码器预训练的GAN风格模型,将语言理解与生成能力统一于单一模型中。该模型名为GanLM,通过两个预训练目标进行训练:替换词检测与替换词去噪。具体而言,给定带掩码的源句子,生成器输出目标分布,判别器则预测从该分布中采样的目标词是否错误。将目标句子中被误分类的词替换,构建含有噪声的上文语境,并用于生成正确句子。总体而言,这两个任务通过选择性利用去噪数据,共同提升了语言理解与生成能力。在语言生成基准上的大量实验表明,具备强大语言理解能力的GanLM优于各类强预训练语言模型(PLMs),并达到了最先进的性能。