We consider the stochastic gradient method with random reshuffling ($\mathsf{RR}$) for tackling smooth nonconvex optimization problems. $\mathsf{RR}$ finds broad applications in practice, notably in training neural networks. In this work, we first investigate the concentration property of $\mathsf{RR}$'s sampling procedure and establish a new high probability sample complexity guarantee for driving the gradient (without expectation) below $\varepsilon$, which effectively characterizes the efficiency of a single $\mathsf{RR}$ execution. Our derived complexity matches the best existing in-expectation one up to a logarithmic term while imposing no additional assumptions nor changing $\mathsf{RR}$'s updating rule. Furthermore, by leveraging our derived high probability descent property and bound on the stochastic error, we propose a simple and computable stopping criterion for $\mathsf{RR}$ (denoted as $\mathsf{RR}$-$\mathsf{sc}$). This criterion is guaranteed to be triggered after a finite number of iterations, and then $\mathsf{RR}$-$\mathsf{sc}$ returns an iterate with its gradient below $\varepsilon$ with high probability. Moreover, building on the proposed stopping criterion, we design a perturbed random reshuffling method ($\mathsf{p}$-$\mathsf{RR}$) that involves an additional randomized perturbation procedure near stationary points. We derive that $\mathsf{p}$-$\mathsf{RR}$ provably escapes strict saddle points and efficiently returns a second-order stationary point with high probability, without making any sub-Gaussian tail-type assumptions on the stochastic gradient errors. Finally, we conduct numerical experiments on neural network training to support our theoretical findings.
翻译:本文考虑采用随机重排($\mathsf{RR}$)的随机梯度方法处理光滑非凸优化问题。$\mathsf{RR}$在实践中应用广泛,尤其在神经网络训练中。本文首先研究了$\mathsf{RR}$采样过程的浓度性质,建立了驱动梯度(无期望)低于$\varepsilon$的新高概率样本复杂度保证,这有效刻画了单次$\mathsf{RR}$执行的效率。我们导出的复杂度与现有最佳期望复杂度相比仅相差对数项,且无需额外假设或改变$\mathsf{RR}$的更新规则。此外,通过利用导出的高概率下降性质和随机误差界,我们提出了一种简单可计算的$\mathsf{RR}$停止准则(记为$\mathsf{RR}$-$\mathsf{sc}$)。该准则保证在有限次迭代后触发,此时$\mathsf{RR}$-$\mathsf{sc}$以高概率返回梯度低于$\varepsilon$的迭代点。进一步,基于所提停止准则,我们设计了扰动随机重排方法($\mathsf{p}$-$\mathsf{RR}$),该方法在驻点附近引入额外的随机扰动过程。我们证明$\mathsf{p}$-$\mathsf{RR}$可证明地逃离严格鞍点,并以高概率高效返回二阶驻点,且无需对随机梯度误差做出任何次高斯尾型假设。最后,我们在神经网络训练上进行了数值实验以支持理论结果。