Transformers are difficult to optimize with stochastic gradient descent (SGD) and largely rely on adaptive optimizers such as Adam. Despite their empirical success, the reasons behind Adam's superior performance over SGD remain poorly understood. In this study, we analyze the optimization of Transformer models through the lens of \emph{gradient heterogeneity}, defined as the variation in gradient norms across parameter blocks. We provide a theoretical analysis showing that gradient heterogeneity, together with Hessian heterogeneity, degrades the convergence of gradient-based methods such as SGD, while sign-based methods are substantially less sensitive to this effect. Adam's coordinate-wise normalization makes its update directions depend mainly on gradient signs, so Adam can be interpreted as a soft variant of SignSGD. Our analysis uses the fact that SGD and SignSGD follow steepest descent directions under different norms, and derives upper bounds on the iteration complexity with implications for learning rate scaling in SignSGD. We further investigate the origin of gradient heterogeneity in Transformer architectures and show that it is strongly influenced by the placement of layer normalization, with Post-LN architectures exhibiting particularly pronounced heterogeneity. Experimental results from fine-tuning Transformers in both NLP and vision domains validate our theoretical analysis. Code is available at https://github.com/tom4649/gradient-heterogeneity.
翻译:Transformer模型使用随机梯度下降(SGD)进行优化较为困难,主要依赖于Adam等自适应优化器。尽管Adam在经验上取得了成功,但其性能优于SGD的原因仍不甚明了。本研究通过**梯度异质性**的视角分析Transformer模型的优化问题,梯度异质性定义为不同参数块间梯度范数的变化。我们提供了理论分析,表明梯度异质性以及Hessian异质性会降低基于梯度的方法(如SGD)的收敛速度,而基于符号的方法对此效应的敏感性显著较低。Adam的逐坐标归一化使其更新方向主要依赖于梯度符号,因此Adam可被解释为SignSGD的一种软变体。我们的分析利用了SGD和SignSGD在不同范数下遵循最陡下降方向这一事实,推导了迭代复杂度的上界,这对SignSGD中的学习率缩放具有启示意义。我们进一步探究了Transformer架构中梯度异质性的来源,并证明其受层归一化位置的强烈影响,其中Post-LN架构表现出尤为显著的异质性。在自然语言处理和视觉领域微调Transformer的实验结果验证了我们的理论分析。代码发布于https://github.com/tom4649/gradient-heterogeneity。