Layer Normalization (LayerNorm) is an inherent component in all Transformer-based models. In this paper, we show that LayerNorm is crucial to the expressivity of the multi-head attention layer that follows it. This is in contrast to the common belief that LayerNorm's only role is to normalize the activations during the forward pass, and their gradients during the backward pass. We consider a geometric interpretation of LayerNorm and show that it consists of two components: (a) projection of the input vectors to a $d-1$ space that is orthogonal to the $\left[1,1,...,1\right]$ vector, and (b) scaling of all vectors to the same norm of $\sqrt{d}$. We show that each of these components is important for the attention layer that follows it in Transformers: (a) projection allows the attention mechanism to create an attention query that attends to all keys equally, offloading the need to learn this operation by the attention; and (b) scaling allows each key to potentially receive the highest attention, and prevents keys from being "un-select-able". We show empirically that Transformers do indeed benefit from these properties of LayeNorm in general language modeling and even in computing simple functions such as "majority". Our code is available at https://github.com/tech-srl/layer_norm_expressivity_role .
翻译:层归一化(LayerNorm)是所有基于Transformer的模型中的固有组件。本文证明,LayerNorm对其后的多头注意力层的表达性至关重要,这与普遍认为其仅在前向传播中归一化激活值、在反向传播中归一化梯度的观点形成鲜明对比。我们提出LayerNorm的几何解释,表明其包含两个组件:(a)将输入向量投影到与$\left[1,1,...,1\right]$向量正交的$d-1$维空间;(b)将所有向量缩放至相同的$\sqrt{d}$范数。我们证明,这两个组件对Transformer中后续的注意力层均具有重要意义:(a)投影使得注意力机制能够创建同等关注所有键的注意力查询,从而免除注意力层自行学习该操作的需求;(b)缩放使得每个键有可能获得最高注意力权重,并防止某些键变得"不可选择"。实验表明,Transformer在通用语言建模乃至计算"多数投票"等简单函数时,确实得益于LayerNorm的这些特性。我们的代码见https://github.com/tech-srl/layer_norm_expressivity_role。