Adaptive optimizers have emerged as powerful tools in deep learning, dynamically adjusting the learning rate based on iterative gradients. These adaptive methods have significantly succeeded in various deep learning tasks, outperforming stochastic gradient descent (SGD). However, despite AdaGrad's status as a cornerstone of adaptive optimization, its theoretical analysis has not adequately addressed key aspects such as asymptotic convergence and non-asymptotic convergence rates in non-convex optimization scenarios. This study aims to provide a comprehensive analysis of AdaGrad, filling the existing gaps in the literature. We introduce an innovative stopping time technique from probabilistic theory, which allows us to establish the stability of AdaGrad under mild conditions for the first time. We further derive the asymptotically almost sure and mean-square convergence for AdaGrad. In addition, we demonstrate the near-optimal non-asymptotic convergence rate measured by the average-squared gradients in expectation, which is stronger than the existing high-probability results. The techniques developed in this work are potentially independent of interest for future research on other adaptive stochastic algorithms.
翻译:自适应优化器已成为深度学习中的强大工具,能够根据迭代梯度动态调整学习率。这些自适应方法在各种深度学习任务中取得了显著成功,其性能超越了随机梯度下降(SGD)。然而,尽管AdaGrad是自适应优化的基石,其理论分析尚未充分解决非凸优化场景中的关键问题,例如渐近收敛性和非渐收敛速率。本研究旨在对AdaGrad进行全面分析,填补现有文献的空白。我们引入了概率论中一种创新的停时技术,首次在温和条件下建立了AdaGrad的稳定性。我们进一步推导了AdaGrad的渐近几乎必然收敛和均方收敛。此外,我们证明了以期望平均平方梯度衡量的近乎最优的非渐近收敛速率,该结果强于现有的高概率结果。本文所发展的技术对于未来研究其他自适应随机算法也具有独立的潜在价值。