Outlier Features (OF) are neurons whose activation magnitudes significantly exceed the average over a neural network's (NN) width. They are well known to emerge during standard transformer training and have the undesirable effect of hindering quantisation in afflicted models. Despite their practical importance, little is known behind why OFs emerge during training, nor how one can minimise them. Our work focuses on the above questions, first identifying several quantitative metrics, such as the kurtosis over neuron activation norms, to measure OFs. With these metrics, we study how architectural and optimisation choices influence OFs, and provide practical insights to minimise OFs during training. As highlights, we emphasise the importance of controlling signal propagation throughout training, and propose the Outlier Protected transformer block, which removes standard Pre-Norm layers to mitigate OFs, without loss of convergence speed or training stability. Overall, our findings shed new light on our understanding of, our ability to prevent, and the complexity of this important facet in NN training dynamics.
翻译:离群特征(OF)是指其激活幅度显著超过神经网络(NN)宽度上平均值的神经元。众所周知,它们会在标准Transformer训练过程中出现,并对受影响模型的量化产生不利影响。尽管其实用重要性很高,但关于OF为何在训练中出现以及如何最小化它们,目前所知甚少。我们的工作聚焦于上述问题,首先确定了几个量化指标,例如神经元激活范数的峰度,以测量OF。利用这些指标,我们研究了架构和优化选择如何影响OF,并提供了在训练过程中最小化OF的实用见解。作为亮点,我们强调了在整个训练过程中控制信号传播的重要性,并提出了离群特征保护Transformer块,该块移除了标准的预归一化层以缓解OF,且不损失收敛速度或训练稳定性。总体而言,我们的发现为理解、预防这一NN训练动态中的重要方面及其复杂性提供了新的视角。