Adaptive gradient optimizers like Adam(W) are the default training algorithms for many deep learning architectures, such as transformers. Their diagonal preconditioner is based on the gradient outer product which is incorporated into the parameter update via a square root. While these methods are often motivated as approximate second-order methods, the square root represents a fundamental difference. In this work, we investigate how the behavior of adaptive methods changes when we remove the root, i.e. strengthen their second-order motivation. Surprisingly, we find that such square-root-free adaptive methods close the generalization gap to SGD on convolutional architectures, while maintaining their root-based counterpart's performance on transformers. The second-order perspective also has practical benefits for the development of non-diagonal adaptive methods through the concept of preconditioner invariance. In contrast to root-based methods like Shampoo, the root-free counterparts do not require numerically unstable matrix root decompositions and inversions, thus work well in half precision. Our findings provide new insights into the development of adaptive methods and raise important questions regarding the currently overlooked role of adaptivity for their success.
翻译:自适应梯度优化器(如Adam(W))已成为Transformer等众多深度学习架构的默认训练算法。其对角预处理矩阵基于梯度外积,并通过平方根运算融入参数更新过程。尽管这类方法常被解释为近似二阶优化方法,但平方根运算构成了一个本质差异。本研究探讨了当移除平方根(即强化其二阶优化动机)时,自适应方法的行为变化。令人惊讶的是,我们发现此类无平方根自适应方法在卷积架构上能缩小与SGD的泛化差距,同时在Transformer上保持与含平方根版本相当的性能。通过预处理矩阵不变性的概念,二阶优化视角还为开发非对角自适应方法提供了实际优势。与Shampoo等含平方根方法不同,无平方根方法无需数值不稳定的矩阵平方根分解与求逆运算,因而能在半精度环境下稳定工作。我们的研究结果为自适应方法的发展提供了新见解,并对当前被忽视的自适应性作用提出了重要质疑。