We propose a novel analysis of the Decentralized Stochastic Gradient Descent (DSGD) algorithm with constant step size, interpreting the iterates of the algorithm as a Markov chain. We show that DSGD converges to a stationary distribution, with its bias, to first order, decomposable into two components: one due to decentralization (growing with the graph's spectral gap and clients' heterogeneity) and one due to stochasticity. Remarkably, the variance of local parameters is, at the first-order, inversely proportional to the number of clients, regardless of the network topology and even when clients' iterates are not averaged at the end. As a consequence of our analysis, we obtain non-asymptotic convergence bounds for clients' local iterates, confirming that DSGD has linear speed-up in the number of clients, and that the network topology only impacts higher-order terms.
翻译:我们提出了一种针对恒定步长分散式随机梯度下降算法的新颖分析,将算法迭代过程解释为一个马尔可夫链。我们证明DSGD会收敛到一个平稳分布,其一阶偏差可分解为两个分量:一个源于分散化(随图谱间隙和客户端异质性增长),另一个源于随机性。值得注意的是,局部参数的一阶方差与客户端数量成反比,这一关系独立于网络拓扑结构,甚至在客户端迭代最终未进行平均的情况下依然成立。基于我们的分析,我们获得了客户端局部迭代的非渐近收敛界,证实了DSGD在客户端数量上具有线性加速特性,且网络拓扑结构仅影响高阶项。