Over the last decades, Stochastic Gradient Descent (SGD) has been intensively studied by the Machine Learning community. Despite its versatility and excellent performance, the optimization of large models via SGD still is a time-consuming task. To reduce training time, it is common to distribute the training process across multiple devices. Recently, it has been shown that the convergence of asynchronous SGD (ASGD) will always be faster than mini-batch SGD. However, despite these improvements in the theoretical bounds, most ASGD convergence-rate proofs still rely on a centralized parameter server, which is prone to become a bottleneck when scaling out the gradient computations across many distributed processes. In this paper, we present a novel convergence-rate analysis for decentralized and asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor restrictive network topologies. Specifically, we provide a bound of $\mathcal{O}(\sigma\epsilon^{-2}) + \mathcal{O}(QS_{avg}\epsilon^{-3/2}) + \mathcal{O}(S_{avg}\epsilon^{-1})$ for the convergence rate of DASGD, where $S_{avg}$ is the average staleness between models, $Q$ is a constant that bounds the norm of the gradients, and $\epsilon$ is a (small) error that is allowed within the bound. Furthermore, when gradients are not bounded, we prove the convergence rate of DASGD to be $\mathcal{O}(\sigma\epsilon^{-2}) + \mathcal{O}(\sqrt{\hat{S}_{avg}\hat{S}_{max}}\epsilon^{-1})$, with $\hat{S}_{max}$ and $\hat{S}_{avg}$ representing a loose version of the average and maximum staleness, respectively. Our convergence proof holds for a fixed stepsize and any non-convex, homogeneous, and L-smooth objective function. We anticipate that our results will be of high relevance for the adoption of DASGD by a broad community of researchers and developers.
翻译:在过去数十年中,随机梯度下降法(Stochastic Gradient Descent, SGD)已成为机器学习领域的研究热点。尽管该方法兼具通用性与卓越性能,但通过SGD优化大型模型仍是耗时任务。为缩短训练时间,通常将训练过程分布至多个设备。最新研究表明,异步SGD(ASGD)的收敛速度始终快于小批量SGD。然而,尽管理论界取得上述改进,大多数ASGD收敛速率证明仍依赖集中式参数服务器——当跨分布式进程扩展梯度计算时,这种架构极易成为性能瓶颈。本文针对去中心化异步SGD(DASGD)提出全新的收敛速率分析框架,该方法既无需节点间部分同步,也不依赖特定网络拓扑结构。具体而言,我们给出DASGD收敛速率界为$\mathcal{O}(\sigma\epsilon^{-2}) + \mathcal{O}(QS_{avg}\epsilon^{-3/2}) + \mathcal{O}(S_{avg}\epsilon^{-1})$,其中$S_{avg}$表示模型间的平均陈旧度,$Q$为梯度范数的约束常数,$\epsilon$为界内允许的(微小)误差。此外,当梯度无界时,我们证明DASGD的收敛速率为$\mathcal{O}(\sigma\epsilon^{-2}) + \mathcal{O}(\sqrt{\hat{S}_{avg}\hat{S}_{max}}\epsilon^{-1})$,其中$\hat{S}_{max}$与$\hat{S}_{avg}$分别表示宽松形式的平均陈旧度与最大陈旧度。该收敛证明适用于固定步长及任意非凸、齐次、L-光滑目标函数。我们预期,这项研究成果将有力推动研究者和开发者社群对DASGD方法的采纳应用。