Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they maintain additional moving average states throughout training, which results in memory requirements several times greater than the model. This overhead imposes constraints on scalability and computational efficiency. On the other hand, while stochastic gradient descent (SGD) is optimal in terms of memory efficiency, their capability in LLM training is limited (Zhao et al., 2024b). To address this dilemma, we show that pre-processing SGD is sufficient to reach Adam-level performance on LLMs. Specifically, we propose to preprocess the instantaneous stochastic gradients with two simple operators: $\mathtt{GradNorm}$ and $\mathtt{GradWhitening}$. $\mathtt{GradNorm}$ stabilizes gradient distributions, and $\mathtt{GradWhitening}$ counteracts the local curvature of the loss landscape, respectively. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any accumulative state variables. Empirically, SWAN has the same memory footprint as SGD, achieving $\approx 50\%$ reduction on total end-to-end memory compared to Adam. In language modeling tasks, SWAN demonstrates the same or even a substantial improvement over Adam. Specifically, when pre-training the LLaMa model with 350M and 1.3B parameters, SWAN achieves a 2x speedup by reaching the same evaluation perplexity in less than half tokens seen.
翻译:自适应优化器(如Adam (Kingma & Ba, 2015))是大型语言模型取得成功的关键。然而,它们在训练过程中需持续维护额外的移动平均状态,这导致其内存需求数倍于模型本身。这种开销限制了模型的可扩展性与计算效率。另一方面,随机梯度下降(SGD)虽在内存效率方面最优,但其在LLM训练中的能力有限(Zhao et al., 2024b)。为解决这一困境,我们证明对SGD进行预处理即足以使其在LLM上达到Adam级别的性能。具体而言,我们提出使用两个简单算子对瞬时随机梯度进行预处理:$\mathtt{GradNorm}$与$\mathtt{GradWhitening}$。$\mathtt{GradNorm}$用于稳定梯度分布,$\mathtt{GradWhitening}$则用于抵消损失函数局部曲率的影响。由此得到的SWAN(白化与归一化随机梯度下降法)作为一种随机优化器,无需存储任何累积状态变量。实验表明,SWAN具有与SGD相同的内存占用,相比Adam实现了端到端总内存约50%的降低。在语言建模任务中,SWAN表现出与Adam相当甚至显著更优的性能。具体而言,在预训练参数规模为3.5亿和13亿的LLaMa模型时,SWAN仅需不到Adam一半的观测词元量即可达到相同的评估困惑度,实现了2倍的训练加速。