Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction

The recently proposed stochastic Polyak stepsize (SPS) and stochastic line-search (SLS) for SGD have shown remarkable effectiveness when training over-parameterized models. However, in non-interpolation settings, both algorithms only guarantee convergence to a neighborhood of a solution which may result in a worse output than the initial guess. While artificially decreasing the adaptive stepsize has been proposed to address this issue (Orvieto et al. [2022]), this approach results in slower convergence rates for convex and over-parameterized models. In this work, we make two contributions: Firstly, we propose two new variants of SPS and SLS, called AdaSPS and AdaSLS, which guarantee convergence in non-interpolation settings and maintain sub-linear and linear convergence rates for convex and strongly convex functions when training over-parameterized models. AdaSLS requires no knowledge of problem-dependent parameters, and AdaSPS requires only a lower bound of the optimal function value as input. Secondly, we equip AdaSPS and AdaSLS with a novel variance reduction technique and obtain algorithms that require $\smash{\widetilde{\mathcal{O}}}(n+1/\epsilon)$ gradient evaluations to achieve an $\mathcal{O}(\epsilon)$-suboptimality for convex functions, which improves upon the slower $\mathcal{O}(1/\epsilon^2)$ rates of AdaSPS and AdaSLS without variance reduction in the non-interpolation regimes. Moreover, our result matches the fast rates of AdaSVRG but removes the inner-outer-loop structure, which is easier to implement and analyze. Finally, numerical experiments on synthetic and real datasets validate our theory and demonstrate the effectiveness and robustness of our algorithms.

翻译：近期提出的随机Polyak步长（SPS）与随机线性搜索（SLS）方法在训练过参数化模型时展现出显著效果。然而，在非插值设定下，这两种算法仅能保证收敛至解的邻域，可能导致输出结果劣于初始猜测。尽管已有研究通过人为降低自适应步长来缓解此问题（Orvieto等，2022），但对于凸函数和过参数化模型，该方法会导致收敛速率下降。本文做出两项贡献：首先，提出SPS与SLS的两种新变体——AdaSPS与AdaSLS，它们能在非插值设定下保证收敛，并在训练过参数化模型时对凸函数和强凸函数分别保持次线性与线性收敛速率。AdaSLS无需知晓问题依赖参数，而AdaSPS仅需输入最优函数值的下界。其次，为AdaSPS与AdaSLS配备新型方差缩减技术，获得仅需$\smash{\widetilde{\mathcal{O}}}(n+1/\epsilon)$次梯度评估即可使凸函数达到$\mathcal{O}(\epsilon)$次优性的算法，这优于非插值区域内未使用方差缩减的AdaSPS与AdaSLS的较慢$\mathcal{O}(1/\epsilon^2)$速率。此外，我们的结果匹配AdaSVRG的快速率，但去除了内外循环结构，更易实现与分析。最终，在合成与真实数据集上的数值实验验证了理论分析，并展示了所提算法的有效性与鲁棒性。