The Polyak stepsize has been proven to be a fundamental stepsize in convex optimization, giving near optimal gradient descent rates across a wide range of assumptions. The universality of the Polyak stepsize has also inspired many stochastic variants, with theoretical guarantees and strong empirical performance. Despite the many theoretical results, our understanding of the convergence properties and shortcomings of the Polyak stepsize or its variants is both incomplete and fractured across different analyses. We propose a new, unified, and simple perspective for the Polyak stepsize and its variants as gradient descent on a surrogate loss. We show that each variant is equivalent to minimize a surrogate function with stepsizes that adapt to a guaranteed local curvature. Our general surrogate loss perspective is then used to provide a unified analysis of existing variants across different assumptions. Moreover, we show a number of negative results proving that the non-convergence results in some of the upper bounds is indeed real.
翻译:Polyak步长已被证明是凸优化中的基础步长,在广泛的假设条件下能提供接近最优的梯度下降速率。Polyak步长的普适性也催生了许多随机变体,这些变体具有理论保证和强大的实证性能。尽管已有众多理论成果,我们对Polyak步长及其变体的收敛特性与局限性的理解仍不完整,且分散在不同分析框架中。本文提出一种新颖、统一且简洁的视角:将Polyak步长及其变体视为对代理损失的梯度下降。我们证明每种变体都等价于最小化一个代理函数,其步长会自适应于有理论保证的局部曲率。这一通用代理损失视角被用于统一分析现有变体在不同假设下的表现。此外,我们通过一系列负面结果证明,某些上界中的非收敛性结论确实存在。