Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order. They have achieved remarkable progress in machine translation as well as many other applications. However, a long-standing challenge for NATs is the learning of multi-modality data distribution, which is the main cause of the performance gap between NATs and ATs. In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution. We derive contrastive constraints to stabilize the training process and integrate this resulting objective with the state-of-the-art NAT architecture DA-Transformer. Our model \method is examined on 3 different tasks, including machine translation, text summarization, and paraphrasing with 5 benchmarks. Results show that our approach outperforms previous non-autoregressive baselines by a significant margin and establishes new state-of-the-art results for non-autoregressive transformers on all the benchmarks.
翻译:非自回归Transformer通过一次性并行预测所有单词而非顺序生成,有效降低了自回归Transformer的推理延迟。这类模型已在机器翻译及其他众多领域取得显著进展。然而,多模态数据分布的学习一直是非自回归Transformer面临的长期挑战,这构成了其与自回归Transformer性能差距的主要原因。本文提出通过从模型分布而非数据分布进行采样来缓解模态学习的难度。我们推导出对比约束条件以稳定训练过程,并将这一优化目标与当前最先进的非自回归Transformer架构DA-Transformer进行整合。所提出的模型在机器翻译、文本摘要和复述三项任务上,基于五个基准数据集进行了验证。结果表明,我们的方法显著超越了以往的非自回归基线模型,并在所有基准测试中为非自回归Transformer建立了新的最优结果。