In recent years, the popular Transformer architecture has achieved great success in many application areas, including natural language processing and computer vision. Many existing works aim to reduce the computational and memory complexity of the self-attention mechanism in the Transformer by trading off performance. However, performance is key for the continuing success of the Transformer. In this paper, a drop-in replacement for the self-attention mechanism in the Transformer, called the Extractor, is proposed. Experimental results show that replacing the self-attention mechanism with the Extractor improves the performance of the Transformer. Furthermore, the proposed Extractor has the potential to run faster than the self-attention since it has a much shorter critical path of computation. Additionally, the sequence prediction problem in the context of text generation is formulated using variable-length discrete-time Markov chains, and the Transformer is reviewed based on our understanding.
翻译:近年来,流行的Transformer架构在自然语言处理和计算机视觉等多个应用领域取得了巨大成功。许多现有工作旨在通过牺牲性能来降低Transformer中自注意力机制的计算和内存复杂度。然而,性能对Transformer的持续成功至关重要。本文提出了一种可直接替代Transformer中自注意力机制的模块,称为Extractor。实验结果表明,用Extractor替代自注意力机制可提升Transformer的性能。此外,由于Extractor的计算关键路径更短,它有可能比自注意力机制运行得更快。同时,本文利用变长离散时间马尔可夫链对文本生成中的序列预测问题进行了形式化描述,并基于我们的理解对Transformer进行了回顾。