The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
翻译:主流的序列转导模型基于复杂的循环或卷积神经网络,采用编码器-解码器结构。表现最好的模型还通过注意力机制连接编码器和解码器。我们提出了一种全新的简单网络架构——Transformer,它完全基于注意力机制,完全摒弃了循环和卷积。在两个机器翻译任务上的实验表明,这些模型在质量上更优,同时具有更高的并行性,且训练时间显著减少。我们的模型在WMT 2014英德翻译任务上取得了28.4 BLEU分数,比现有的最佳结果(包括集成模型)提升了超过2个BLEU。在WMT 2014英法翻译任务上,我们的模型在八块GPU上训练3.5天后,建立了新的单模型最佳BLEU分数41.8,仅为文献中最佳模型训练成本的一小部分。我们还展示了Transformer可以很好地推广到其他任务,成功将其应用于英语成分句法分析,无论训练数据规模大小。