In this work, we offer a theoretical analysis of two modern optimization techniques for training large and complex models: (i) adaptive optimization algorithms, such as Adam, and (ii) the model exponential moving average (EMA). Specifically, we demonstrate that a clipped version of Adam with model EMA achieves the optimal convergence rates in various nonconvex optimization settings, both smooth and nonsmooth. Moreover, when the scale varies significantly across different coordinates, we demonstrate that the coordinate-wise adaptivity of Adam is provably advantageous. Notably, unlike previous analyses of Adam, our analysis crucially relies on its core elements -- momentum and discounting factors -- as well as model EMA, motivating their wide applications in practice.
翻译:本文针对训练大型复杂模型的两种现代优化技术进行了理论分析:(i)自适应优化算法(如Adam),以及(ii)模型指数移动平均(EMA)。具体而言,我们证明了采用模型EMA的裁剪版Adam算法在平滑与非平滑的多种非凸优化场景中均能达到最优收敛速率。此外,当不同坐标间的尺度差异显著时,我们证明Adam算法的坐标自适应特性具有可证明的优势。值得注意的是,与以往对Adam的分析不同,我们的分析关键性地依赖于其核心要素——动量项与折扣因子——以及模型EMA机制,这为它们在实践中的广泛应用提供了理论依据。