Running out of GPU memory has become a main bottleneck for large-scale DNN training. How to reduce the memory footprint during training has received intensive research attention. We find that previous gradient accumulation reduces activation memory but fails to be compatible with gradient memory reduction due to a contradiction between preserving gradients and releasing gradients. To address this issue, we propose a novel optimizer accumulation method for Adam, named Adam Accumulation (AdamA), which enables reducing both activation and gradient memory. Specifically, AdamA directly integrates gradients into optimizer states and accumulates optimizer states over micro-batches, so that gradients can be released immediately after use. We mathematically and experimentally demonstrate AdamA yields the same convergence properties as Adam. Evaluated on transformer-based models, AdamA achieves up to 23% memory reduction compared to gradient accumulation with less than 2% degradation in training throughput. Notably, AdamA can work together with memory reduction methods for optimizer states to fit 1.26x~3.14x larger models over PyTorch and DeepSpeed baseline on GPUs with different memory capacities.
翻译:GPU内存不足已成为大规模深度神经网络训练的主要瓶颈。如何减少训练过程中的内存占用问题受到研究界的广泛关注。我们发现,先前的梯度累积方法虽能减少激活值内存,却因保留与释放梯度之间的矛盾而无法兼容降低梯度内存。为解决此问题,我们针对Adam优化器提出一种新型累积方法——Adam Accumulation(AdamA),该方法能同时减少激活值与梯度的内存占用。具体而言,AdamA将梯度直接整合至优化器状态,并通过微批次累积优化器状态,使得梯度在使用后即可立即释放。我们从数学和实验两个维度证明AdamA具有与Adam相同的收敛性质。在基于Transformer的模型评估中,与梯度累积相比,AdamA可实现高达23%的内存节约,且训练吞吐量下降幅度低于2%。值得注意的是,AdamA可与其他优化器状态内存压缩方法协同使用,从而在具备不同内存容量的GPU上,通过PyTorch和DeepSpeed基线拟合比原始模型大1.26倍至3.14倍的模型。