Adaptive gradient methods, such as Adam and LAMB, have demonstrated excellent performance in the training of large language models. Nevertheless, the need for adaptivity requires maintaining second-moment estimates of the per-parameter gradients, which entails a high cost of extra memory overheads. To solve this problem, several memory-efficient optimizers (e.g., Adafactor) have been proposed to obtain a drastic reduction in auxiliary memory usage, but with a performance penalty. In this paper, we first study a confidence-guided strategy to reduce the instability of existing memory efficient optimizers. Based on this strategy, we propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods. Extensive experiments demonstrate the training stability and superior performance of CAME across various NLP tasks such as BERT and GPT-2 training. Notably, for BERT pre-training on the large batch size of 32,768, our proposed optimizer attains faster convergence and higher accuracy compared with the Adam optimizer. The implementation of CAME is publicly available.
翻译:自适应梯度方法(如Adam和LAMB)在大语言模型训练中展现出卓越性能。然而,自适应机制需要维护每个参数梯度的二阶矩估计,这导致了额外内存开销的高昂成本。为解决该问题,研究人员提出了多种内存高效优化器(例如Adafactor),显著降低了辅助内存使用量,但牺牲了部分性能。本文首先研究了一种基于置信度引导的策略,以降低现有内存高效优化器的不稳定性。基于该策略,我们提出了CAME优化器,同时实现两大目标:传统自适应方法的快速收敛性,以及内存高效方法低内存消耗的特点。大量实验表明,CAME在BERT和GPT-2训练等多种自然语言处理任务中表现出训练稳定性与优越性能。值得注意的是,在批量大小为32,768的BERT预训练中,我们提出的优化器相较于Adam优化器实现了更快的收敛速度和更高的准确率。CAME的实现代码已公开发布。