On the Role of Attention Masks and LayerNorm in Transformers

Self-attention is the key mechanism of transformers, which are the essential building blocks of modern foundation models. Recent studies have shown that pure self-attention suffers from an increasing degree of rank collapse as depth increases, limiting model expressivity and further utilization of model depth. The existing literature on rank collapse, however, has mostly overlooked other critical components in transformers that may alleviate the rank collapse issue. In this paper, we provide a general analysis of rank collapse under self-attention, taking into account the effects of attention masks and layer normalization (LayerNorm). In particular, we find that although pure masked attention still suffers from exponential collapse to a rank one subspace, sparse or local masked attention can provably slow down the collapse rate. In the case of self-attention with LayerNorm, we first show that for certain classes of value matrices, collapse to a rank one subspace still happens exponentially. However, through construction of nontrivial counterexamples, we then establish that with proper choice of value matrices, a general class of sequences may not converge to a rank one subspace, and the self-attention dynamics with LayerNorm can simultaneously possess a rich set of equilibria with any possible rank between one and full. Our result refutes the previous hypothesis that LayerNorm plays no role in the rank collapse of self-attention and suggests that self-attention with LayerNorm constitutes a much more expressive, versatile nonlinear dynamical system than what was originally thought.

翻译：自注意力是Transformer的核心机制，后者是现代基础模型的基本构建模块。近期研究表明，随着深度增加，纯自注意力机制会遭受日益严重的秩坍塌问题，从而限制模型表达能力并阻碍模型深度的进一步利用。然而，现有关于秩坍塌的文献大多忽视了Transformer中其他可能缓解该问题的关键组件。本文在考虑注意力掩码与层归一化（LayerNorm）影响的前提下，对自注意力机制下的秩坍塌问题进行了系统性分析。特别地，我们发现：虽然纯掩码注意力仍会以指数速度坍塌至秩一子空间，但稀疏或局部掩码注意力可被证明能有效减缓坍塌速率。对于结合LayerNorm的自注意力机制，我们首先证明对于特定类别的值矩阵，向秩一子空间的指数坍塌依然存在。然而，通过构造非平凡反例，我们进一步证实：在适当选择值矩阵的情况下，广义序列类别可能不会收敛至秩一子空间，且结合LayerNorm的自注意力动力学系统可同时具备从秩一到满秩的任意可能秩的丰富平衡点集合。本研究成果否定了先前关于LayerNorm对自注意力秩坍塌无影响的假设，表明结合LayerNorm的自注意力构成了比原有认知更具表达力与适应性的非线性动力学系统。