Adaptive gradient methods, such as Adam and LAMB, have demonstrated excellent performance in the training of large language models. Nevertheless, the need for adaptivity requires maintaining second-moment estimates of the per-parameter gradients, which entails a high cost of extra memory overheads. To solve this problem, several memory-efficient optimizers (e.g., Adafactor) have been proposed to obtain a drastic reduction in auxiliary memory usage, but with a performance penalty. In this paper, we first study a confidence-guided strategy to reduce the instability of existing memory efficient optimizers. Based on this strategy, we propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods. Extensive experiments demonstrate the training stability and superior performance of CAME across various NLP tasks such as BERT and GPT-2 training. Notably, for BERT pre-training on the large batch size of 32,768, our proposed optimizer attains faster convergence and higher accuracy compared with the Adam optimizer. The implementation of CAME is publicly available.
翻译:自适应梯度方法(如Adam和LAMB)在大语言模型训练中展现了卓越性能。然而,实现自适应性需要维护逐参数梯度的二阶矩估计,这导致高昂的额外内存开销。为解决该问题,研究者提出了多种内存高效优化器(如Adafactor),大幅降低辅助内存使用量,但会导致性能损失。本文首先研究了一种置信度引导策略,用以降低现有内存高效优化器的不稳定性。基于该策略,我们提出CAME方法,同时实现两个目标:传统自适应方法的快速收敛与内存高效方法的低内存占用。大量实验表明,CAME在BERT和GPT-2训练等各类NLP任务中具有优异的训练稳定性与卓越性能。值得注意的是,在32,768大批次下进行BERT预训练时,与Adam优化器相比,我们提出的优化器实现了更快的收敛速度和更高的准确率。CAME的实现代码已公开发布。