The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude ($\ell_2$ norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP. We leverage the emergent discrete structure in a saturated transformer to analyze the role of different attention heads, finding that some focus locally on a small number of positions, while other heads compute global averages, allowing counting. We believe understanding the interplay between these two capabilities may shed further light on the structure of computation within large transformers.
翻译:众所周知,包括广泛采用的Transformer在内的神经网络具有很高的容量。越来越多的证据表明,它们之所以能够成功学习,是因为训练过程(通常是梯度下降(GD)的变体)中存在的归纳偏置。为了更好地理解这种偏置,我们研究了Transformer参数在训练期间其幅度($\ell_2$范数)增长的趋势,及其对自注意力层中涌现表征的影响。通过实验,我们记录了Transformer语言模型(包括T5在预训练期间)训练过程中的范数增长。随着参数幅度的增长,我们证明了网络近似于一个具有饱和激活函数的离散化网络。这类“饱和”网络与可以通过形式语言和自动机描述的全网络族相比,其容量有所降低。我们的结果表明,饱和是一种对GD中隐含的归纳偏置的新刻画,对NLP领域尤为关键。我们利用饱和Transformer中涌现的离散结构来分析不同注意力头的作用,发现一些注意力头局部聚焦于少量位置,而其他头则计算全局平均值以实现计数功能。我们相信,理解这两种能力之间的相互作用,可能有助于进一步揭示大型Transformer内部的计算结构。