We propose a new variant of the Adam optimizer called MicroAdam that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees. We achieve this by compressing the gradient information before it is fed into the optimizer state, thereby reducing its memory footprint significantly. We control the resulting compression error via a novel instance of the classical \emph{error feedback} mechanism from distributed optimization in which *the error correction information is itself compressed* to allow for practical memory gains. We prove that the resulting approach maintains theoretical convergence guarantees competitive to those of AMSGrad, while providing good practical performance. Specifically, we show that MicroAdam can be implemented efficiently on GPUs: on both million-scale (BERT) and billion-scale (LLaMA) models, MicroAdam provides practical convergence competitive to that of the uncompressed Adam baseline, with lower memory usage and similar running time. Our code is available at https://github.com/IST-DASLab/MicroAdam.
翻译:我们提出了一种名为MicroAdam的新型Adam优化器变体,该变体在保持理论收敛保证的同时,显著降低了内存开销。我们通过在梯度信息输入优化器状态之前对其进行压缩,从而大幅减少其内存占用。我们通过分布式优化中经典**误差反馈**机制的一种新颖实例来控制由此产生的压缩误差,其中*误差校正信息本身也被压缩*,以实现实际的内存增益。我们证明了所提出的方法保持了与AMSGrad相当的理论收敛保证,同时提供了良好的实际性能。具体而言,我们展示了MicroAdam可以在GPU上高效实现:在百万参数规模(BERT)和十亿参数规模(LLaMA)模型上,MicroAdam的实际收敛性能与未压缩的Adam基线相当,同时内存使用更低且运行时间相近。我们的代码可在 https://github.com/IST-DASLab/MicroAdam 获取。