High-probability Convergence Guarantees of Decentralized SGD

Convergence in high-probability (HP) has attracted increasing interest, due to implying exponentially decaying tail bounds and strong guarantees for individual runs of an algorithm. While many works study HP guarantees in centralized settings, much less is understood in the decentralized setup, where existing works require strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise. This results in a significant gap between the assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, and is also contrary to centralized settings, where it is known that $\mathtt{SGD}$ converges in HP under the same conditions on the cost function as needed for MSE convergence. Motivated by these observations, we study the HP convergence of Decentralized $\mathtt{SGD}$ ($\mathtt{DSGD}$) in the presence of light-tailed noise, providing several strong results. First, we show that $\mathtt{DSGD}$ converges in HP under the same conditions on the cost as in the MSE sense, removing the restrictive assumptions used in prior works. Second, our sharp analysis yields order-optimal rates for both non-convex and strongly convex costs. Third, we establish a linear speed-up in the number of users, leading to matching, or strictly better transient times than those obtained from MSE results, further underlining the tightness of our analysis. To the best of our knowledge, this is the first work that shows $\mathtt{DSGD}$ achieves a linear speed-up in the HP sense. Our relaxed assumptions and sharp rates stem from several technical results of independent interest, including a result on the variance-reduction effect of decentralized methods in the HP sense, as well as a novel bound on the MGF of strongly convex costs, which is of interest even in centralized settings. Finally, we provide experiments that validate our theory.

翻译：高概率（HP）收敛因其蕴含指数衰减的尾部边界以及对算法单次运行的强保证而日益受到关注。尽管许多研究集中于集中式环境下的高概率保证，但在去中心化设置中的理解则远为不足，现有工作需依赖强假设，如梯度一致有界或渐近消失的噪声。这导致了建立高概率收敛与均方误差（MSE）意义收敛所用假设之间的显著差距，且与集中式环境相悖——在集中式环境中，已知$\mathtt{SGD}$在成本函数所需条件与MSE收敛相同的情况下即可实现高概率收敛。受这些观察启发，我们在轻尾噪声存在下研究去中心化$\mathtt{SGD}$（$\mathtt{DSGD}$）的高概率收敛，提供了若干强结果。首先，我们证明$\mathtt{DSGD}$在成本函数的条件与MSE意义相同时即可实现高概率收敛，从而消除了先前工作中使用的限制性假设。其次，我们的精细分析为非凸和强凸成本函数均得到了阶数最优的收敛速率。再次，我们确立了用户数量上的线性加速，使得瞬态时间与MSE结果相匹配或严格更优，进一步凸显了我们分析的紧致性。据我们所知，这是首个证明$\mathtt{DSGD}$在高概率意义上实现线性加速的工作。我们放宽的假设与精细的收敛速率源于若干具有独立价值的技术结果，包括去中心化方法在高概率意义上的方差缩减效应结果，以及对强凸成本函数矩母函数的新颖界——该结果即使在集中式环境中也具有重要意义。最后，我们通过实验验证了理论。