Training deep neural networks--and more recently, large models--demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.
翻译:训练深度神经网络——尤其是近年来兴起的大型模型——需要高效且可扩展的优化器。诸如Adam、AdamW及其变体等自适应梯度算法一直是这一任务的核心。尽管过去十年中开发了众多旨在加速凸与非凸场景下随机优化的方差缩减算法,但方差缩减在训练深度神经网络或大型语言模型方面尚未取得广泛成功。因此,它在现代人工智能领域中仍是一种较少被青睐的方法。本文为释放方差缩减在高效训练大型模型中的潜力,提出了一个统一的优化框架MARS(Make vAriance Reduction Shine),该框架通过一种缩放式随机递归动量技术,将预条件梯度方法与方差缩减相协调。在我们的框架内,我们引入了三个MARS实例,它们分别利用基于AdamW、Lion和Shampoo的预条件梯度更新。我们还建立了我们的算法与现有优化器之间的联系。在训练GPT-2模型上的实验结果表明,MARS始终以显著优势超越AdamW。