In recent years, the popular Transformer architecture has achieved great success in many application areas, including natural language processing and computer vision. Many existing works aim to reduce the computational and memory complexity of the self-attention mechanism in the Transformer by trading off performance. However, performance is key for the continuing success of the Transformer. In this paper, a family of drop-in replacements for the self-attention mechanism in the Transformer, called the Extractors, is proposed. Four types of the Extractors, namely the super high-performance Extractor (SHE), the higher-performance Extractor (HE), the worthwhile Extractor (WE), and the minimalist Extractor (ME), are proposed as examples. Experimental results show that replacing the self-attention mechanism with the SHE evidently improves the performance of the Transformer, whereas the simplified versions of the SHE, i.e., the HE, the WE, and the ME, perform close to or better than the self-attention mechanism with less computational and memory complexity. Furthermore, the proposed Extractors have the potential or are able to run faster than the self-attention mechanism since their critical paths of computation are much shorter. Additionally, the sequence prediction problem in the context of text generation is formulated using variable-length discrete-time Markov chains, and the Transformer is reviewed based on our understanding.
翻译:近年来,流行的Transformer架构已在包括自然语言处理和计算机视觉在内的众多应用领域取得巨大成功。许多现有工作旨在通过牺牲性能来降低Transformer中自注意力机制的计算和内存复杂度。然而,性能对于Transformer的持续成功至关重要。本文提出了一系列可替代Transformer中自注意力机制的即插即用模块,称为提取器(Extractors)。作为示例,提出了四种类型的提取器,即超高性能提取器(SHE)、高性能提取器(HE)、值得采用提取器(WE)和极简提取器(ME)。实验结果表明,用SHE替换自注意力机制能显著提升Transformer的性能,而其简化版本——HE、WE和ME——在计算和内存复杂度较低的情况下,性能接近或优于自注意力机制。此外,由于所提提取器的关键计算路径更短,它们具有运行速度超过自注意力机制的潜力甚至实现可能。同时,本文利用变长离散时间马尔可夫链对文本生成中的序列预测问题进行了形式化建模,并基于我们的理解对Transformer进行了重新审视。