Adaptive gradient algorithms have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite their huge success in practice, their theoretical advantages over stochastic gradient descent (SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice. This is because the only theoretical result that can demonstrate the benefit of Adagrad over SGD was obtained in the original paper of Adagrad for nonsmooth objective functions. However, for nonsmooth objective functions, there can be a linear slowdown of convergence when batch size increases, and thus a convergence analysis based on nonsmooth assumption cannot be used for large batch algorithms. In this work, we resolve this gap between theory and practice by providing a new analysis of Adagrad on both convex and nonconvex smooth objectives suitable for the large batch setting. It is shown that under the anisotropic smoothness and noise conditions, increased batch size does not slow down convergence for Adagrad, and thus it can still achieve a faster convergence guarantee over SGD even in the large batch setting. We present detailed comparisons between SGD and Adagrad to provide a better understanding of the benefits of adaptive gradient methods. Experiments in logistic regression and instruction following fine-tuning tasks provide strong evidence to support our theoretical analysis.
翻译:自适应梯度算法已广泛应用于大规模深度神经网络(尤其是大型基础模型)的训练中。尽管在实践中取得了巨大成功,但其相对于随机梯度下降(SGD)的理论优势尚未被完全理解,特别是在实际常用的大批量训练场景中。这是因为目前唯一能证明Adagrad优于SGD的理论结果是在Adagrad原始论文中针对非光滑目标函数获得的。然而,对于非光滑目标函数,批量增大可能导致收敛速度线性下降,因此基于非光滑假设的收敛分析无法适用于大批量算法。本研究通过为Adagrad在凸与非凸光滑目标函数上提供适用于大批量场景的新分析,弥合了理论与实践之间的鸿沟。研究表明,在各向异性平滑性与噪声条件下,批量增大不会降低Adagrad的收敛速度,因此即使在大批量场景下,它仍能获得比SGD更快的收敛保证。我们详细比较了SGD与Adagrad,以增进对自适应梯度方法优势的理解。逻辑回归与指令跟随微调任务的实验为理论分析提供了有力证据。