Stochastic gradient descent (SGD) and its variants are the main workhorses for solving large-scale optimization problems with nonconvex objective functions. Although the convergence of SGDs in the (strongly) convex case is well-understood, their convergence for nonconvex functions stands on weak mathematical foundations. Most existing studies on the nonconvex convergence of SGD show the complexity results based on either the minimum of the expected gradient norm or the functional sub-optimality gap (for functions with extra structural property) by searching the entire range of iterates. Hence the last iterations of SGDs do not necessarily maintain the same complexity guarantee. This paper shows that an $\epsilon$-stationary point exists in the final iterates of SGDs, given a large enough total iteration budget, $T$, not just anywhere in the entire range of iterates -- a much stronger result than the existing one. Additionally, our analyses allow us to measure the density of the $\epsilon$-stationary points in the final iterates of SGD, and we recover the classical $O(\frac{1}{\sqrt{T}})$ asymptotic rate under various existing assumptions on the objective function and the bounds on the stochastic gradient. As a result of our analyses, we addressed certain myths and legends related to the nonconvex convergence of SGD and posed some thought-provoking questions that could set new directions for research.
翻译:随机梯度下降(SGD)及其变体是求解大规模非凸目标函数优化问题的主要工具。尽管SGD在(强)凸情况下的收敛性已得到充分理解,但其在非凸函数上的收敛性却缺乏坚实的数学基础。现有关于SGD非凸收敛的大多数研究,通过搜索整个迭代范围,要么基于期望梯度范数的最小值,要么基于函数次优性差距(针对具有额外结构特性的函数)来展示复杂度结果。因此,SGD的最后几次迭代并不一定保持相同的复杂度保证。本文证明,在总迭代预算$T$足够大的情况下,$\epsilon$-驻点存在于SGD的最后迭代中,而不仅仅存在于整个迭代范围的任何位置——这比现有结果强得多。此外,我们的分析使我们能够衡量SGD最后迭代中$\epsilon$-驻点的密度,并且在目标函数和随机梯度边界的各种现有假设下,恢复了经典的$O(\frac{1}{\sqrt{T}})$渐近速率。通过分析,我们澄清了与SGD非凸收敛相关的某些迷雾与传说,并提出了一些发人深省的问题,可能为研究开辟新的方向。