APOLLO: SGD-like Memory, AdamW-level Performance

Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.

翻译：大型语言模型（LLM）在训练过程中，尤其是使用流行的AdamW优化器时，存在显著的内存消耗问题。这一内存负担迫使训练需要使用更多或更高端的GPU，或减小批处理大小，从而限制了训练的可扩展性和吞吐量。为解决此问题，已有多种内存高效优化器被提出以降低优化器内存使用。然而，它们面临关键挑战：（i）依赖计算成本高昂的奇异值分解（SVD）操作；（ii）与AdamW相比存在显著的性能折衷；（iii）为保持有竞争力的性能仍需承担可观的优化器内存开销。在本工作中，我们发现AdamW的学习率自适应规则可以被有效地粗化为一种结构化的学习率更新规则。基于这一洞见，我们提出了用于内存高效LLM优化的近似梯度缩放方法（APOLLO），该方法基于纯随机投影，利用一个辅助的低秩优化器状态来近似学习率缩放。这种结构化的学习率更新规则使得APOLLO对进一步的内存削减具有高度容忍性，同时能提供可比的预训练性能。即使是其秩为1的变体APOLLO-Mini，也能以SGD级别的内存成本实现优于AdamW的预训练性能。大量实验表明，APOLLO系列优化器的性能与AdamW相当或更优，同时通过几乎消除AdamW的优化状态实现了更大的内存节省。这些节省带来了显著的系统级优势：（1）提升吞吐量：在8xA100-80GB配置下，通过支持4倍大的批处理大小，相比AdamW实现了3倍的吞吐量。（2）改进模型可扩展性：在A100-80GB GPU上，无需系统级优化，即可使用朴素的分布式数据并行（DDP）预训练LLaMA-13B模型。（3）支持低端GPU预训练：在单个GPU上，结合权重量化，使用少于12 GB的内存即可预训练LLaMA-7B模型。