Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD

This work considers the problem of finding a first-order stationary point of a non-convex function with potentially unbounded smoothness constant using a stochastic gradient oracle. We focus on the class of $(L_0,L_1)$-smooth functions proposed by Zhang et al. (ICLR'20). Empirical evidence suggests that these functions more closely captures practical machine learning problems as compared to the pervasive $L_0$-smoothness. This class is rich enough to include highly non-smooth functions, such as $\exp(L_1 x)$ which is $(0,\mathcal{O}(L_1))$-smooth. Despite the richness, an emerging line of works achieves the $\widetilde{\mathcal{O}}(\frac{1}{\sqrt{T}})$ rate of convergence when the noise of the stochastic gradients is deterministically and uniformly bounded. This noise restriction is not required in the $L_0$-smooth setting, and in many practical settings is either not satisfied, or results in weaker convergence rates with respect to the noise scaling of the convergence rate. We develop a technique that allows us to prove $\mathcal{O}(\frac{\mathrm{poly}\log(T)}{\sqrt{T}})$ convergence rates for $(L_0,L_1)$-smooth functions without assuming uniform bounds on the noise support. The key innovation behind our results is a carefully constructed stopping time $\tau$ which is simultaneously "large" on average, yet also allows us to treat the adaptive step sizes before $\tau$ as (roughly) independent of the gradients. For general $(L_0,L_1)$-smooth functions, our analysis requires the mild restriction that the multiplicative noise parameter $\sigma_1 < 1$. For a broad subclass of $(L_0,L_1)$-smooth functions, our convergence rate continues to hold when $\sigma_1 \geq 1$. By contrast, we prove that many algorithms analyzed by prior works on $(L_0,L_1)$-smooth optimization diverge with constant probability even for smooth and strongly-convex functions when $\sigma_1 > 1$.

翻译：本文考虑使用随机梯度预言机寻找具有潜在无界光滑常数的非凸函数的一阶驻点问题。我们聚焦于Zhang等人(ICLR'20)提出的$(L_0,L_1)$-光滑函数类。经验证据表明，相较于广泛使用的$L_0$-光滑性，这类函数更能捕捉实际机器学习问题。该类函数足够丰富，可包含高度非光滑函数，例如$(0,\mathcal{O}(L_1))$-光滑的$\exp(L_1 x)$。尽管具有丰富性，新兴研究在随机梯度噪声确定性一致有界条件下实现了$\widetilde{\mathcal{O}}(\frac{1}{\sqrt{T}})$的收敛速率。这一噪声限制在$L_0$-光滑设置中并非必要，且在许多实际场景中要么不满足，要么导致收敛速率在噪声缩放方面较弱。我们开发了一种技术，能够在无需假设噪声支持一致有界的情况下，证明$(L_0,L_1)$-光滑函数的$\mathcal{O}(\frac{\mathrm{poly}\log(T)}{\sqrt{T}})$收敛速率。我们结果的关键创新在于精心构造了一个停止时间$\tau$，该时间同时具有平均意义上的"大"特性，又能让我们将$\tau$之前的自适应步长视为（大致）独立于梯度。对于一般$(L_0,L_1)$-光滑函数，我们的分析要求乘性噪声参数$\sigma_1 < 1$这一温和限制。对于$(L_0,L_1)$-光滑函数的一个广泛子类，当$\sigma_1 \geq 1$时，我们的收敛速率仍然成立。相反，我们证明先前关于$(L_0,L_1)$-光滑优化研究的许多算法，即使对于光滑且强凸的函数，当$\sigma_1 > 1$时也以恒定概率发散。