signSGD is popular in nonconvex optimization due to its communication efficiency. Yet, existing analyses typically assume data are sampled with replacement in each iteration, contradicting a common practical implementation where data are randomly reshuffled and sequentially fed into the algorithm. This gap leaves the theoretical understanding of the more practical algorithm, signSGD with random reshuffling (SignRR), largely unexplored. We develop the first analysis of SignRR to identify the core technical challenge that prevents a thorough convergence analysis of this method. In particular, given a dataset of size $n$ and $T$ epochs, we show that the expected gradient norm of SignRR is upper bounded by $O(\log(nT)/\sqrt{nT} + σ)$, where $σ$ is the averaged conditional mean square error that may not vanish. To tackle this limitation, we develop two new sign-based algorithms under random reshuffling: SignRVR, which incorporates variance-reduced gradients, and SignRVM, which integrates momentum-based updates. Both algorithms achieve a faster convergence rate of ${O}(\log(nT)/\sqrt{nT} +\log(nT)\sqrt{n}/\sqrt{T})$. We further extend our algorithms to a distributed setting, with a convergence rate of ${O}(\log(n_0T)/\sqrt{n_0T} +\log (n_0T)\sqrt{n_0}/\sqrt{T})$, where $n_0$ is the size of the dataset of a single machine. These results mark the first step towards the theoretical understanding of practical implementation of sign-based optimization algorithms. Finally, we back up our theoretical findings through experiments on simulated and real-world problems, verifying that randomly reshuffled sign methods match or surpass existing baselines.
翻译:符号随机梯度下降(signSGD)因其通信高效性在非凸优化中得到广泛应用。然而,现有分析通常假设每次迭代中数据采用有放回采样,这与实际常见的实现方式——数据被随机重排后顺序输入算法——存在矛盾。这一差异导致对更实用的随机重排符号随机梯度下降(SignRR)算法的理论理解尚未充分探索。我们首次对SignRR进行了理论分析,揭示了阻碍该方法完整收敛性分析的核心技术挑战。具体而言,给定规模为$n$的数据集和$T$个训练周期,我们证明SignRR的期望梯度范数上界为$O(\log(nT)/\sqrt{nT} + σ)$,其中$σ$为可能不收敛的平均条件均方误差。为克服此局限,我们提出了两种新型随机重排符号算法:融合方差缩减梯度的SignRVR算法,以及集成动量更新的SignRVM算法。两种算法均实现了${O}(\log(nT)/\sqrt{nT} +\log(nT)\sqrt{n}/\sqrt{T})$的更快收敛速率。我们进一步将算法扩展至分布式场景,获得${O}(\log(n_0T)/\sqrt{n_0T} +\log (n_0T)\sqrt{n_0}/\sqrt{T})$的收敛速率,其中$n_0$表示单机数据集规模。这些成果标志着在理解符号优化算法实际实现理论方面迈出了重要一步。最后,我们通过仿真和实际问题的实验验证了理论结论,证明随机重排符号方法的性能达到或超越了现有基线。