Adaptive gradient optimizers like Adam(W) are the default training algorithms for many deep learning architectures, such as transformers. Their diagonal preconditioner is based on the gradient outer product which is incorporated into the parameter update via a square root. While these methods are often motivated as approximate second-order methods, the square root represents a fundamental difference. In this work, we investigate how the behavior of adaptive methods changes when we remove the root, i.e. strengthen their second-order motivation. Surprisingly, we find that such square-root-free adaptive methods close the generalization gap to SGD on convolutional architectures, while maintaining their root-based counterpart's performance on transformers. The second-order perspective also has practical benefits for the development of adaptive methods with non-diagonal preconditioner. In contrast to root-based counterparts like Shampoo, they do not require numerically unstable matrix square roots and therefore work well in low precision, which we demonstrate empirically. This raises important questions regarding the currently overlooked role of adaptivity for the success of adaptive methods since the success is often attributed to sign descent induced by the root.
翻译:类似Adam(W)的自适应梯度优化器是许多深度学习架构(如Transformer)的默认训练算法。其对角预条件基于梯度外积,并通过平方根运算融入参数更新。尽管这些方法常被视作近似二阶方法,但平方根运算代表了一个根本性差异。本研究探讨了去除平方根(即强化其二阶动机)时自适应方法行为的变化。令人意外的是,我们发现此类无平方根自适应方法能在卷积架构上弥合与SGD的泛化差距,同时在Transformer上保持基于平方根方法的性能。从二阶视角出发,这对开发非对角预条件的自适应方法具有实际益处:与基于平方根的Shampoo等方法不同,它们无需数值不稳定的矩阵平方根运算,因此能在低精度场景下良好运行——我们通过实验验证了这一点。这引出了一个重要问题:由于当前常将自适应方法的成功归因于平方根引发的符号下降,那么自适应能力在其成功中究竟扮演了被忽视的关键角色?