Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization

signSGD is popular in nonconvex optimization due to its communication efficiency. Yet, existing analyses of signSGD rely on assuming that data are sampled with replacement in each iteration, contradicting the practical implementation where data are randomly reshuffled and sequentially fed into the algorithm. We bridge this gap by proving the first convergence result of signSGD with random reshuffling (SignRR) for nonconvex optimization. Given the dataset size $n$, the number of epochs of data passes $T$, and the variance bound of a stochastic gradient $\sigma^2$, we show that SignRR has the same convergence rate $O(\log(nT)/\sqrt{nT} + \|\sigma\|_1)$ as signSGD \citep{bernstein2018signsgd}. We then present SignRVR and SignRVM, which leverage variance-reduced gradients and momentum updates respectively, both converging at $O(\log(nT)/\sqrt{nT})$. In contrast with the analysis of signSGD, our results do not require an extremely large batch size in each iteration to be of the same order as the total number of iterations \citep{bernstein2018signsgd} or the signs of stochastic and true gradients match element-wise with a minimum probability of 1/2 \citep{safaryan2021stochastic}. We also extend our algorithms to cases where data are distributed across different machines, yielding dist-SignRVR and dist-SignRVM, both converging at $O(\log(n_0T)/\sqrt{n_0T})$, where $n_0$ is the dataset size of a single machine. We back up our theoretical findings through experiments on simulated and real-world problems, verifying that randomly reshuffled sign methods match or surpass existing baselines.

翻译：符号随机梯度下降法（signSGD）因其通信效率高而在非凸优化中广受欢迎。然而，现有对signSGD的分析均假设每次迭代中数据有放回地采样，这与实际实现中数据随机重排并顺序输入算法的做法相矛盾。我们通过证明非凸优化中随机重排符号SGD（SignRR）的首个收敛结果来填补这一空白。给定数据集大小$n$、数据遍历轮数$T$以及随机梯度的方差上界$\sigma^2$，我们证明SignRR与signSGD \citep{bernstein2018signsgd}具有相同的收敛速率$O(\log(nT)/\sqrt{nT} + \|\sigma\|_1)$。我们进一步提出SignRVR和SignRVM，分别利用方差缩减梯度和动量更新，二者均以$O(\log(nT)/\sqrt{nT})$的速率收敛。与signSGD分析不同的是，我们的结果既不要求每次迭代的批量大小与总迭代次数同阶 \citep{bernstein2018signsgd}，也不要求随机梯度与真实梯度的符号以至少1/2的概率逐元素匹配 \citep{safaryan2021stochastic}。我们还将算法扩展到数据分布在不同机器上的场景，得到dist-SignRVR和dist-SignRVM，二者均以$O(\log(n_0T)/\sqrt{n_0T})$的速率收敛（其中$n_0$为单台机器的数据集大小）。我们通过模拟和实际问题的实验验证理论结果，证明随机重排符号方法能够匹配甚至超越现有基线方法。