Layer Normalization (LayerNorm) is an inherent component in all Transformer-based models. In this paper, we show that LayerNorm is crucial to the expressivity of the multi-head attention layer that follows it. This is in contrast to the common belief that LayerNorm's only role is to normalize the activations during the forward pass, and their gradients during the backward pass. We consider a geometric interpretation of LayerNorm and show that it consists of two components: (a) projection of the input vectors to a $d-1$ space that is orthogonal to the $\left[1,1,...,1\right]$ vector, and (b) scaling of all vectors to the same norm of $\sqrt{d}$. We show that each of these components is important for the attention layer that follows it in Transformers: (a) projection allows the attention mechanism to create an attention query that attends to all keys equally, offloading the need to learn this operation by the attention; and (b) scaling allows each key to potentially receive the highest attention, and prevents keys from being "un-select-able". We show empirically that Transformers do indeed benefit from these properties of LayeNorm in general language modeling and even in computing simple functions such as "majority". Our code is available at https://github.com/tech-srl/layer_norm_expressivity_role .
翻译:层归一化(Layer Normalization,简称LayerNorm)是所有基于Transformer的模型中固有的组成部分。本文表明,LayerNorm对其后的多头注意力层的表达性至关重要。这不同于普遍认为LayerNorm仅在前向传播中归一化激活值、在反向传播中归一化梯度的观点。我们考虑LayerNorm的几何解释,并表明它包含两个组成部分:(a)将输入向量投影到与$\left[1,1,...,1\right]$向量正交的$d-1$维空间,以及(b)将所有向量缩放至相同的$\sqrt{d}$范数。我们证明这两个组成部分对Transformer中其后的注意力层均具有重要意义:(a)投影使注意力机制能够创建同等关注所有键的注意力查询,从而省去注意力层学习此操作的需求;(b)缩放使每个键可能获得最高注意力,并防止键变得“不可选择”。我们通过实验证明,Transformer在通用语言建模乃至“多数投票”等简单函数计算中,确实受益于LayerNorm的这些特性。我们的代码可在https://github.com/tech-srl/layer_norm_expressivity_role获取。