Non-autoregressive machine translation (NAT) models have lower translation quality than autoregressive translation (AT) models because NAT decoders do not depend on previous target tokens in the decoder input. We propose a novel and general Dependency-Aware Decoder (DePA) to enhance target dependency modeling in the decoder of fully NAT models from two perspectives: decoder self-attention and decoder input. First, we propose an autoregressive forward-backward pre-training phase before NAT training, which enables the NAT decoder to gradually learn bidirectional target dependencies for the final NAT training. Second, we transform the decoder input from the source language representation space to the target language representation space through a novel attentive transformation process, which enables the decoder to better capture target dependencies. DePA can be applied to any fully NAT models. Extensive experiments show that DePA consistently improves highly competitive and state-of-the-art fully NAT models on widely used WMT and IWSLT benchmarks by up to 1.88 BLEU gain, while maintaining the inference latency comparable to other fully NAT models.
翻译:摘要:非自回归机器翻译(NAT)模型由于解码器输入中不依赖先前目标词元,其翻译质量低于自回归翻译(AT)模型。我们提出一种新颖且通用的依赖感知解码器(DePA),从解码器自注意力与解码器输入两个角度增强全NAT模型中的目标依赖建模。首先,我们提出在NAT训练前进行自回归前向-后向预训练阶段,使NAT解码器逐步学习双向目标依赖关系,以服务于最终NAT训练。其次,通过一种新颖的注意力转换过程,将解码器输入从源语言表示空间映射到目标语言表示空间,使解码器能更好地捕获目标依赖。DePA可适用于任意全NAT模型。大量实验表明,DePA能在保持与其他全NAT模型相当的推理延迟的同时,在广泛使用的WMT和IWSLT基准上持续改进具有高度竞争力和最先进水平的全NAT模型,BLEU值最高提升1.88。