We analyze asynchronous-type algorithms for distributed SGD in the heterogeneous setting, where each worker has its own computation and communication speeds, as well as data distribution. In these algorithms, workers compute possibly stale and stochastic gradients associated with their local data at some iteration back in history and then return those gradients to the server without synchronizing with other workers. We present a unified convergence theory for non-convex smooth functions in the heterogeneous regime. The proposed analysis provides convergence for pure asynchronous SGD and its various modifications. Moreover, our theory explains what affects the convergence rate and what can be done to improve the performance of asynchronous algorithms. In particular, we introduce a novel asynchronous method based on worker shuffling. As a by-product of our analysis, we also demonstrate convergence guarantees for gradient-type algorithms such as SGD with random reshuffling and shuffle-once mini-batch SGD. The derived rates match the best-known results for those algorithms, highlighting the tightness of our approach. Finally, our numerical evaluations support theoretical findings and show the good practical performance of our method.
翻译:我们分析了异构环境下分布式SGD的异步型算法,其中每个工作节点具有各自的计算速度、通信速度以及数据分布。在这些算法中,工作节点计算历史迭代中可能过时的、与其本地数据相关的随机梯度,并将这些梯度返回给服务器,而无需与其他工作节点同步。我们提出了一种在异构机制下针对非凸光滑函数的统一收敛理论。该分析为纯异步SGD及其各种变体提供了收敛性。此外,我们的理论解释了影响收敛速度的因素,以及如何改进异步算法性能的方法。特别地,我们引入了一种基于工作节点洗牌的新型异步方法。作为分析的副产品,我们还证明了梯度型算法(如采用随机重排的SGD和一次洗牌小批量SGD)的收敛保证。推导出的速率与这些算法已知的最佳结果一致,凸显了我们方法的紧致性。最终,数值评估支持了理论发现,并展示了我们方法良好的实际性能。