Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of $\mathbb{R}^{m\times n}$, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.
翻译:深度学习优化器的设计动机通常融合了凸优化理论与近似二阶方法。本文选取三种典型方法——Adam、Shampoo与Prodigy——论证每种方法均可被重新理解为完全基于一阶方法且无需凸性假设。事实上,在关闭指数移动平均后,每种方法都等价于特定范数下的最速下降法。通过推广这一观察,我们为训练算法描绘了新的设计空间:应根据张量在网络中的作用为其分配不同的算子范数。例如,虽然线性层与嵌入层可能具有相同的权重空间$\mathbb{R}^{m\times n}$,但这些层在网络中承担不同功能,因此应赋予不同的范数。我们期望这种对神经网络架构进行精细度量的思想,能够催生更稳定、可扩展且真正更高效的训练方法。