Transformer-based autoregressive (AR) methods have achieved appealing performance for varied sequence-to-sequence generation tasks, e.g., neural machine translation, summarization, and code generation, but suffer from low inference efficiency. To speed up the inference stage, many non-autoregressive (NAR) strategies have been proposed in the past few years. Among them, the conditional masked language model (CMLM) is one of the most versatile frameworks, as it can support many different sequence generation scenarios and achieve very competitive performance on these tasks. In this paper, we further introduce a simple yet effective adaptive masking over masking strategy to enhance the refinement capability of the decoder and make the encoder optimization easier. Experiments on \textbf{3} different tasks (neural machine translation, summarization, and code generation) with \textbf{15} datasets in total confirm that our proposed simple method achieves significant performance improvement over the strong CMLM model. Surprisingly, our proposed model yields state-of-the-art performance on neural machine translation (\textbf{34.62} BLEU on WMT16 EN$\to$RO, \textbf{34.82} BLEU on WMT16 RO$\to$EN, and \textbf{34.84} BLEU on IWSLT De$\to$En) and even better performance than the \textbf{AR} Transformer on \textbf{7} benchmark datasets with at least \textbf{2.2$\times$} speedup. Our code is available at GitHub.
翻译:基于Transformer的自回归方法在各类序列到序列生成任务(如神经机器翻译、摘要生成和代码生成)中取得了令人瞩目的性能,但存在推理效率低的问题。为了加速推理阶段,近年来提出了多种非自回归策略。其中,条件掩码语言模型(CMLM)是最通用的框架之一,因其能支持多种序列生成场景,并在这些任务中实现极具竞争力的性能。本文进一步提出一种简单而有效的自适应掩码嵌套掩码策略,以增强解码器的精炼能力并降低编码器的优化难度。在\textbf{3}个不同任务(神经机器翻译、摘要生成和代码生成)的\textbf{15}个数据集上的实验表明,我们提出的简易方法相比强基线CMLM模型取得了显著的性能提升。令人惊讶的是,我们的模型在神经机器翻译任务上达到了最优性能(WMT16英$\to$罗语BLEU值\textbf{34.62},WMT16罗$\to$英语BLEU值\textbf{34.82},IWSLT德$\to$英语BLEU值\textbf{34.84}),且在\textbf{7}个基准数据集上性能甚至优于\textbf{AR} Transformer,同时实现至少\textbf{2.2$\times$}的加速比。我们的代码已开源至GitHub。