High Probability Guarantees for Random Reshuffling

We consider the stochastic gradient method with random reshuffling ($\mathsf{RR}$) for tackling smooth nonconvex optimization problems. $\mathsf{RR}$ finds broad applications in practice, notably in training neural networks. In this work, we first investigate the concentration property of $\mathsf{RR}$'s sampling procedure and establish a new high probability sample complexity guarantee for driving the gradient (without expectation) below $\varepsilon$, which effectively characterizes the efficiency of a single $\mathsf{RR}$ execution. Our derived complexity matches the best existing in-expectation one up to a logarithmic term while imposing no additional assumptions nor changing $\mathsf{RR}$'s updating rule. Furthermore, by leveraging our derived high probability descent property and bound on the stochastic error, we propose a simple and computable stopping criterion for $\mathsf{RR}$ (denoted as $\mathsf{RR}$-$\mathsf{sc}$). This criterion is guaranteed to be triggered after a finite number of iterations, and then $\mathsf{RR}$-$\mathsf{sc}$ returns an iterate with its gradient below $\varepsilon$ with high probability. Moreover, building on the proposed stopping criterion, we design a perturbed random reshuffling method ($\mathsf{p}$-$\mathsf{RR}$) that involves an additional randomized perturbation procedure near stationary points. We derive that $\mathsf{p}$-$\mathsf{RR}$ provably escapes strict saddle points and efficiently returns a second-order stationary point with high probability, without making any sub-Gaussian tail-type assumptions on the stochastic gradient errors. Finally, we conduct numerical experiments on neural network training to support our theoretical findings.

翻译：本文考虑采用随机重排（$\mathsf{RR}$）的随机梯度方法处理光滑非凸优化问题。$\mathsf{RR}$在实践中应用广泛，尤其在神经网络训练中。本文首先研究了$\mathsf{RR}$采样过程的浓度性质，建立了驱动梯度（无期望）低于$\varepsilon$的新高概率样本复杂度保证，这有效刻画了单次$\mathsf{RR}$执行的效率。我们导出的复杂度与现有最佳期望复杂度相比仅相差对数项，且无需额外假设或改变$\mathsf{RR}$的更新规则。此外，通过利用导出的高概率下降性质和随机误差界，我们提出了一种简单可计算的$\mathsf{RR}$停止准则（记为$\mathsf{RR}$-$\mathsf{sc}$）。该准则保证在有限次迭代后触发，此时$\mathsf{RR}$-$\mathsf{sc}$以高概率返回梯度低于$\varepsilon$的迭代点。进一步，基于所提停止准则，我们设计了扰动随机重排方法（$\mathsf{p}$-$\mathsf{RR}$），该方法在驻点附近引入额外的随机扰动过程。我们证明$\mathsf{p}$-$\mathsf{RR}$可证明地逃离严格鞍点，并以高概率高效返回二阶驻点，且无需对随机梯度误差做出任何次高斯尾型假设。最后，我们在神经网络训练上进行了数值实验以支持理论结果。