Adaptive gradient optimizers (AdaGrad), which dynamically adjust the learning rate based on iterative gradients, have emerged as powerful tools in deep learning. These adaptive methods have significantly succeeded in various deep learning tasks, outperforming stochastic gradient descent. However, despite AdaGrad's status as a cornerstone of adaptive optimization, its theoretical analysis has not adequately addressed key aspects such as asymptotic convergence and non-asymptotic convergence rates in non-convex optimization scenarios. This study aims to provide a comprehensive analysis of AdaGrad and bridge the existing gaps in the literature. We introduce a new stopping time technique from probability theory, which allows us to establish the stability of AdaGrad under mild conditions. We further derive the asymptotically almost sure and mean-square convergence for AdaGrad. In addition, we demonstrate the near-optimal non-asymptotic convergence rate measured by the average-squared gradients in expectation, which is stronger than the existing high-probability results. The techniques developed in this work are potentially of independent interest for future research on other adaptive stochastic algorithms.
翻译:自适应梯度优化器(AdaGrad)通过基于迭代梯度动态调整学习率,已成为深度学习中强有力的工具。这类自适应方法在各种深度学习任务中取得了显著成功,其性能超越了随机梯度下降。然而,尽管AdaGrad是自适应优化的基石方法,其理论分析尚未充分解决非凸优化场景中的关键问题,例如渐近收敛性与非渐近收敛速率。本研究旨在对AdaGrad进行全面分析,弥补现有文献的空白。我们引入概率论中的新停时技术,从而在温和条件下建立AdaGrad的稳定性。我们进一步推导了AdaGrad的渐近几乎必然收敛与均方收敛。此外,我们证明了以期望平均梯度平方度量的近最优非渐近收敛速率,该结果强于现有的高概率收敛结论。本文发展的技术对未来研究其他自适应随机算法可能具有独立的学术价值。