In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer's attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffusion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.
翻译:在深度学习理论中,表征的协方差矩阵作为检验网络可训练性的代理指标。受Transformer成功经验的启发,我们在无限深度与宽度的比例极限下,研究采用跳跃连接的改进型Softmax注意力模型的协方差矩阵。我们证明在初始化阶段,极限分布可通过以深度-宽度比为索引的随机微分方程描述。为实现良定义的随机极限,Transformer的注意力机制被修改为:将Softmax输出以单位矩阵为中心,并通过宽度相关的温度参数缩放Softmax对数几率。通过对应的随机微分方程检验网络稳定性,揭示了如何借助残差连接精妙地控制漂移项与扩散项的比例。稳定随机微分方程的存在性表明,即使在深度和宽度极大的情况下协方差结构仍保持良好性质,从而避免了深度注意力模型中臭名昭著的秩退化问题。最后,通过仿真实验证明,该随机微分方程对相应有限尺寸模型具有令人惊讶的良好描述能力。我们将此类架构改进命名为“重塑Transformer”。