The Transformer architecture consists of self-attention and feed-forward networks (FFNs) which can be viewed as key-value memories according to previous works. However, FFN and traditional memory utilize different activation functions (i.e., ReLU and Softmax respectively), which makes them not equivalent. In this paper, we first rebuild the connections between FFN and key-value memory by conducting extensive studies on ReLU and Softmax, and find they are equivalent when adding an additional layer normalization module on Softmax. In addition, ReLU outperforms Softmax on both FFN and key-value memory when the number of value slots is large. We analyze the reasons and then explore this good property of ReLU on the self-attention network where the original Softmax activation performs poorly on long input sequences. We then propose a full ReLU architecture named ReLUFormer which performs better than the baseline Transformer on long sequence tasks such as document translation. This paper sheds light on the following points: 1) Softmax and ReLU use different normalization methods over elements which lead to different variances of results, and ReLU is good at dealing with a large number of key-value slots; 2) FFN and key-value memory are equivalent, and thus the Transformer can be viewed as a memory network where FFNs and self-attention networks are both key-value memories.
翻译:Transformer 架构由自注意力和前馈网络(FFN)组成,根据先前研究,它们可视为键值记忆。然而,FFN 与传统记忆采用不同的激活函数(分别为ReLU和Softmax),这使得二者并不等效。本文首先通过对ReLU和Softmax的广泛研究,重新构建了FFN与键值记忆之间的联系,并发现当对Softmax额外添加层归一化模块时,两者等价。此外,在值槽数量较大时,ReLU在FFN和键值记忆上的性能均优于Softmax。我们分析了其原因,进而探索了ReLU在自注意力网络中的这一优良特性——原始Softmax激活函数在处理长输入序列时表现不佳。随后我们提出了全ReLU架构,命名为ReLUFormer,其在文档翻译等长序列任务上性能优于基线Transformer。本文揭示了以下要点:1)Softmax和ReLU在元素上采用不同的归一化方法,导致结果方差不同,且ReLU擅长处理大量键值槽;2)FFN与键值记忆等价,因此Transformer可视为一种记忆网络,其中FFN和自注意力网络均为键值记忆。