Adaptive optimizers have emerged as powerful tools in deep learning, dynamically adjusting the learning rate based on iterative gradients. These adaptive methods have significantly succeeded in various deep learning tasks, outperforming stochastic gradient descent (SGD). However, although AdaGrad is a cornerstone adaptive optimizer, its theoretical analysis is inadequate in addressing asymptotic convergence and non-asymptotic convergence rates on non-convex optimization. This study aims to provide a comprehensive analysis and complete picture of AdaGrad. We first introduce a novel stopping time technique from probabilistic theory to establish stability for the norm version of AdaGrad under milder conditions. We further derive two forms of asymptotic convergence: almost sure and mean-square. Furthermore, we demonstrate the near-optimal non-asymptotic convergence rate measured by the average-squared gradients in expectation, which is rarely explored and stronger than the existing high-probability results, under the mild assumptions. The techniques developed in this work are potentially independent of interest for future research on other adaptive stochastic algorithms.
翻译:自适应优化器已成为深度学习中的强大工具,其能够根据迭代梯度动态调整学习率。这些自适应方法在各种深度学习任务中取得了显著成功,其性能超越了随机梯度下降(SGD)。然而,尽管AdaGrad是基础的自适应优化器,其理论分析在解决非凸优化中的渐近收敛性和非渐近收敛速率方面尚不充分。本研究旨在为AdaGrad提供一个全面的分析和完整的理论图景。我们首先引入概率论中的一种新颖的停止时间技术,以在更温和的条件下为AdaGrad的范数版本建立稳定性。我们进一步推导了两种形式的渐近收敛:几乎必然收敛和均方收敛。此外,在温和的假设下,我们证明了以期望平均平方梯度衡量的近乎最优的非渐近收敛速率,这一结果鲜有探索,且强于现有的高概率结果。本工作中发展的技术可能对今后研究其他自适应随机算法具有独立的参考价值。