SGD performs worse than Adam by a significant margin on Transformers, but the reason remains unclear. In this work, we provide an explanation of SGD's bad performance on Transformers through the lens of Hessian: (i) Transformers are "heterogeneous": the Hessian spectrum across parameter blocks vary dramatically, a phenomenon we call "block heterogeneity"; (ii) Heterogeneity hampers SGD: SGD performs badly on problems with block heterogeneity. To validate that heterogeneity hampers SGD, we check various Transformers, CNNs, MLPs, and quadratic problems, and find that SGD works well on problems without block heterogeneity but performs badly when the heterogeneity exists. Our initial theoretical analysis indicates that SGD performs poorly because it applies one single learning rate to all blocks, which cannot handle the heterogeneity among blocks. This limitation could be ameliorated if we use coordinate-wise learning rates, as designed in Adam.
翻译:在Transformer模型上,SGD的表现显著差于Adam,但其原因尚不明确。本文通过Hessian矩阵的视角解释SGD在Transformer上表现不佳的原因:(i) Transformer具有“异质性”:不同参数块的Hessian谱差异显著,我们将此现象称为“块异质性”;(ii) 异质性阻碍SGD:SGD在处理具有块异质性的问题时表现较差。为验证异质性对SGD的阻碍作用,我们检验了多种Transformer、CNN、MLP及二次优化问题,发现SGD在没有块异质性的问题上表现良好,但在存在异质性的问题上表现不佳。我们的初步理论分析表明,SGD表现较差是因为它对所有参数块使用单一学习率,无法处理块间的异质性。若采用如Adam所设计的坐标自适应学习率,这一局限可得到改善。