We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures. This is achieved by integrating out the self-attention layer in Transformers, producing a modified dynamics of MLP layers only. JoMA removes unrealistic assumptions in previous analysis (e.g., lack of residual connection) and predicts that the attention first becomes sparse (to learn salient tokens), then dense (to learn less salient tokens) in the presence of nonlinear activations, while in the linear case, it is consistent with existing works that show attention becomes sparse over time. We leverage JoMA to qualitatively explains how tokens are combined to form hierarchies in multilayer Transformers, when the input tokens are generated by a latent hierarchical generative model. Experiments on models trained from real-world dataset (Wikitext2/Wikitext103) and various pre-trained models (OPT, Pythia) verify our theoretical findings. Code can be found in https://github.com/facebookresearch/luckmatters/tree/yuandong3.
翻译:我们提出联合MLP/注意力(JoMA)动力学,这是一个新颖的数学框架,用于理解多层Transformer架构的训练过程。该框架通过积掉Transformer中的自注意力层,得到仅由MLP层构成的修正动力学。JoMA摒弃了以往分析中不切实际的假设(例如缺乏残差连接),并预测:在非线性激活函数存在的情况下,注意力会先变得稀疏(用于学习显著标记),随后变得密集(用于学习非显著标记);而在线性情况下,其结论与现有研究一致,即注意力随时间推移趋于稀疏。我们利用JoMA定性解释了在输入标记由潜在层次生成模型生成时,多层Transformer中标记如何组合形成层次结构。基于真实数据集(Wikitext2/Wikitext103)训练的模型以及多种预训练模型(OPT、Pythia)的实验验证了我们的理论发现。代码见 https://github.com/facebookresearch/luckmatters/tree/yuandong3。