Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This accounts for gradients being local linear approximations which lose their relevance as the iterate moves along the loss landscape. This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal: a single EMA cannot simultaneously give a high weight to the immediate past, and a non-negligible weight to older gradients. Building on this observation, we propose AdEMAMix, a simple modification of the Adam optimizer with a mixture of two EMAs to better take advantage of past gradients. Our experiments on language modeling and image classification show -- quite surprisingly -- that gradients can stay relevant for tens of thousands of steps. They help to converge faster, and often to lower minima: e.g., a $1.3$B parameter AdEMAMix LLM trained on $101$B tokens performs comparably to an AdamW model trained on $197$B tokens ($+95\%$). Moreover, our method significantly slows-down model forgetting during training. Our work motivates further exploration of different types of functions to leverage past gradients, beyond EMAs.
翻译:基于动量的优化器在广泛的机器学习应用中至关重要。这类优化器通常依赖于梯度的指数移动平均(EMA),该平均会以指数方式衰减旧梯度对当前贡献的影响。这是因为梯度作为局部线性近似,会随着迭代在损失函数曲面上移动而失去其相关性。本研究对使用单一EMA来累积历史梯度的做法提出质疑,并通过实验证明这种选择可能并非最优:单一EMA无法同时赋予近期梯度较高权重,又为旧梯度保留不可忽略的权重。基于此观察,我们提出AdEMAMix——一种对Adam优化器的简单改进,通过混合两个EMA来更好地利用历史梯度。我们在语言建模和图像分类上的实验表明——令人惊讶的是——梯度在数万步迭代后仍能保持相关性。它们有助于加速收敛,并往往能达到更低的极小值:例如,在$101$B词元上训练的$1.3$B参数AdEMAMix大语言模型,其性能可与在$197$B词元($+95\%$)上训练的AdamW模型相媲美。此外,我们的方法显著减缓了训练过程中的模型遗忘现象。本研究为超越EMA范畴、探索利用历史梯度的不同类型函数提供了新的研究动机。