Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While previous work has studied different scaling regimes for transformers, an asymptotically exact, down-to-the constant prescription for how to initialise transformers has so far been lacking. Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and MLP. Our theory yields a simple algorithm to compute trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. We overcome the key challenge, an exact treatment of the self-attention layer, by establishing a formal parallel with the Random Energy Model from statistical physics. We also analyse gradients in the backward path and determine the regime where gradients vanish at initialisation. We demonstrate the versatility of our framework through three case studies. Our theoretical framework gives a unified perspective on the two failure modes of self-attention and gives quantitative predictions on the scale of both weights and residual connections that guarantee smooth training.
翻译:为神经网络寻找恰当的初始化方法对于确保训练平稳与性能优异至关重要。在Transformer中,错误的初始化可能导致自注意力层出现两种失效模式之一:秩崩塌(所有标记坍缩为相似表示)与熵崩塌(高度集中的注意力分数导致训练不稳定)。尽管先前研究已探讨了Transformer的不同缩放机制,但迄今为止仍缺乏关于如何初始化Transformer的、渐近精确且包含常数项的明确方案。本文提出了一个关于信号在包含自注意力、层归一化、跳跃连接及MLP的深度Transformer中传播的解析理论。该理论推导出一个简单算法,可计算训练性图谱,从而为给定架构确定正确的初始化超参数选择。我们通过建立与统计物理学中随机能量模型的正式对应关系,克服了精确处理自注意力层这一关键挑战。同时,我们分析了反向传播中的梯度,并确定了初始化阶段梯度消失的机制。通过三个案例研究,我们展示了本框架的普适性。我们的理论框架为自注意力的两种失效模式提供了统一视角,并对保证平稳训练的权重与残差连接尺度给出了定量预测。