Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization

signSGD is popular in nonconvex optimization due to its communication efficiency. Yet, existing analyses of signSGD rely on assuming that data are sampled with replacement in each iteration, contradicting the practical implementation where data are randomly reshuffled and sequentially fed into the algorithm. We bridge this gap by proving the first convergence result of signSGD with random reshuffling (SignRR) for nonconvex optimization. Given the dataset size $n$, the number of epochs of data passes $T$, and the variance bound of a stochastic gradient $\sigma^2$, we show that SignRR has the same convergence rate $O(\log(nT)/\sqrt{nT} + \|\sigma\|_1)$ as signSGD \citep{bernstein2018signsgd}. We then present SignRVR and SignRVM, which leverage variance-reduced gradients and momentum updates respectively, both converging at $O(\log (nT)/\sqrt{nT} + \log (nT)\sqrt{n}/\sqrt{T})$. In contrast with the analysis of signSGD, our results do not require an extremely large batch size in each iteration to be of the same order as the total number of iterations \citep{bernstein2018signsgd} or the signs of stochastic and true gradients match element-wise with a minimum probability of 1/2 \citep{safaryan2021stochastic}. We also extend our algorithms to cases where data are distributed across different machines, yielding dist-SignRVR and dist-SignRVM, both converging at $O(\log (n_0T)/\sqrt{n_0T} + \log (n_0T)\sqrt{n_0}/\sqrt{T})$, where $n_0$ is the dataset size of a single machine. We back up our theoretical findings through experiments on simulated and real-world problems, verifying that randomly reshuffled sign methods match or surpass existing baselines.

翻译：符号随机梯度下降（signSGD）因通信高效在非凸优化中广受欢迎。然而，现有对signSGD的分析均假设每次迭代数据有放回采样，这与实际实现中数据随机重排后顺序输入算法的实践相矛盾。本文通过证明非凸优化中带随机重排的signSGD（SignRR）的首个收敛结果来弥合这一鸿沟。给定数据集规模$n$、数据遍历轮次$T$以及随机梯度方差上界$\sigma^2$，我们证明SignRR具有与signSGD \citep{bernstein2018signsgd}相同的收敛速率$O(\log(nT)/\sqrt{nT} + \|\sigma\|_1)$。随后我们提出SignRVR和SignRVM，分别利用方差缩减梯度与动量更新，两者均收敛于$O(\log (nT)/\sqrt{nT} + \log (nT)\sqrt{n}/\sqrt{T})$。与signSGD分析不同，我们的结果既不要求每轮迭代的批大小与总迭代次数同阶\citep{bernstein2018signsgd}，也不要求随机梯度与真实梯度的符号逐元素以至少1/2的最小概率匹配\citep{safaryan2021stochastic}。我们还将算法拓展至数据分布在不同机器上的场景，得到dist-SignRVR与dist-SignRVM，两者均收敛于$O(\log (n_0T)/\sqrt{n_0T} + \log (n_0T)\sqrt{n_0}/\sqrt{T})$，其中$n_0$为单台机器的数据集规模。通过在模拟与真实问题上的实验验证理论发现，结果表明随机重排符号方法能够匹配或超越现有基线。