In current deep learning tasks, Adam style optimizers such as Adam, Adagrad, RMSProp, Adafactor, and Lion have been widely used as alternatives to SGD style optimizers. These optimizers typically update model parameters using the sign of gradients, resulting in more stable convergence curves. The learning rate and the batch size are the most critical hyperparameters for optimizers, which require careful tuning to enable effective convergence. Previous research has shown that the optimal learning rate increases linearly or follows similar rules with batch size for SGD style optimizers. However, this conclusion is not applicable to Adam style optimizers. In this paper, we elucidate the connection between optimal learning rates and batch sizes for Adam style optimizers through both theoretical analysis and extensive experiments. First, we raise the scaling law between batch sizes and optimal learning rates in the sign of gradient case, in which we prove that the optimal learning rate first rises and then falls as the batch size increases. Moreover, the peak value of the surge will gradually move toward the larger batch size as training progresses. Second, we conducted experiments on various CV and NLP tasks and verified the correctness of the scaling law.
翻译:在当前深度学习任务中,Adam类优化器(如Adam、Adagrad、RMSProp、Adafactor和Lion)已作为SGD类优化器的替代方案被广泛使用。这类优化器通常利用梯度符号更新模型参数,从而获得更稳定的收敛曲线。学习率与批量大小是优化器最为关键的超参数,需精细调节以实现有效收敛。先前研究表明,对于SGD类优化器,最优学习率随批量大小线性增长或遵循类似规律。然而,该结论并不适用于Adam类优化器。本文通过理论分析与大量实验,阐明了Adam类优化器最优学习率与批量大小之间的关联。首先,我们提出了梯度符号情形下批量大小与最优学习率之间的缩放规律,证明了最优学习率随批量增大呈现先上升后下降的趋势。此外,激增现象的峰值会随训练进程逐步向更大批量方向移动。其次,我们在多种计算机视觉与自然语言处理任务上进行了实验,验证了该缩放规律的正确性。