We propose a new variant of the Adam optimizer [Kingma and Ba, 2014] called MICROADAM that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees. We achieve this by compressing the gradient information before it is fed into the optimizer state, thereby reducing its memory footprint significantly. We control the resulting compression error via a novel instance of the classical error feedback mechanism from distributed optimization [Seide et al., 2014, Alistarh et al., 2018, Karimireddy et al., 2019] in which the error correction information is itself compressed to allow for practical memory gains. We prove that the resulting approach maintains theoretical convergence guarantees competitive to those of AMSGrad, while providing good practical performance. Specifically, we show that MICROADAM can be implemented efficiently on GPUs: on both million-scale (BERT) and billion-scale (LLaMA) models, MicroAdam provides practical convergence competitive to that of the uncompressed Adam baseline, with lower memory usage and similar running time. Our code is available at https://github.com/IST-DASLab/MicroAdam.
翻译:我们提出了一种Adam优化器[Kingma and Ba, 2014]的新变体,称为MICROADAM,该变体在保持理论收敛性保证的同时,显著降低了内存开销。我们通过在梯度信息输入优化器状态前对其进行压缩,从而大幅减少其内存占用。我们通过分布式优化中经典误差反馈机制[Seide et al., 2014, Alistarh et al., 2018, Karimireddy et al., 2019]的一种新颖实例来控制由此产生的压缩误差,其中误差校正信息本身也被压缩,以实现实际的内存增益。我们证明了该方法保持了与AMSGrad相当的理论收敛性保证,同时提供了良好的实际性能。具体而言,我们展示了MICROADAM可在GPU上高效实现:在百万参数规模(BERT)和十亿参数规模(LLaMA)模型上,MicroAdam的实际收敛性能与未压缩的Adam基线相当,且内存使用更低、运行时间相近。我们的代码公开于https://github.com/IST-DASLab/MicroAdam。