Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers

Transformers have achieved great success in machine learning applications. Normalization techniques, such as Layer Normalization (LayerNorm, LN) and Root Mean Square Normalization (RMSNorm), play a critical role in accelerating and stabilizing the training of Transformers. While LayerNorm recenters and rescales input vectors, RMSNorm only rescales the vectors by their RMS value. Despite being more computationally efficient, RMSNorm may compromise the representation ability of Transformers. There is currently no consensus regarding the preferred normalization technique, as some models employ LayerNorm while others utilize RMSNorm, especially in recent large language models. It is challenging to convert Transformers with one normalization to the other type. While there is an ongoing disagreement between the two normalization types, we propose a solution to unify two mainstream Transformer architectures, Pre-LN and Pre-RMSNorm Transformers. By removing the inherent redundant mean information in the main branch of Pre-LN Transformers, we can reduce LayerNorm to RMSNorm, achieving higher efficiency. We further propose the Compressed RMSNorm (CRMSNorm) and Pre-CRMSNorm Transformer based on a lossless compression of the zero-mean vectors. We formally establish the equivalence of Pre-LN, Pre-RMSNorm, and Pre-CRMSNorm Transformer variants in both training and inference. It implies that Pre-LN Transformers can be substituted with Pre-(C)RMSNorm counterparts at almost no cost, offering the same arithmetic functionality along with free efficiency improvement. Experiments demonstrate that we can reduce the training and inference time of Pre-LN Transformers by 1% - 10%.

翻译：Transformer在机器学习应用中取得了巨大成功。归一化技术，如层归一化（LayerNorm, LN）和均方根归一化（Root Mean Square Normalization, RMSNorm），在加速和稳定Transformer训练中发挥着关键作用。层归一化对输入向量进行中心化和缩放，而RMSNorm仅通过其均方根值对向量进行缩放。尽管计算效率更高，RMSNorm可能会牺牲Transformer的表征能力。目前关于首选归一化技术尚无共识，一些模型采用LayerNorm，而另一些则使用RMSNorm，尤其是在近期的大语言模型中。将一种归一化类型的Transformer转换为另一种类型具有挑战性。针对两种归一化类型之间的持续争议，我们提出了一种统一两种主流Transformer架构（Pre-LN和Pre-RMSNorm Transformer）的方案。通过移除Pre-LN Transformer主分支中固有的冗余均值信息，我们可以将层归一化简化为RMSNorm，从而实现更高的效率。我们进一步基于对零均值向量的无损压缩，提出了压缩均方根归一化（CRMSNorm）和Pre-CRMSNorm Transformer。我们正式建立了Pre-LN、Pre-RMSNorm和Pre-CRMSNorm Transformer变体在训练和推理中的等价性。这意味着Pre-LN Transformer可以几乎零成本地替换为Pre-(C)RMSNorm对应版本，提供相同的算术功能，同时带来免费效率提升。实验表明，我们可以将Pre-LN Transformer的训练和推理时间减少1%至10%。