Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework

Distributed stochastic gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning. However, the staggers and limited bandwidth may induce random computational/communication delays, thereby severely hindering the learning process. Therefore, how to accelerate asynchronous SGD by efficiently scheduling multiple workers is an important issue. In this paper, a unified framework is presented to analyze and optimize the convergence of asynchronous SGD based on stochastic delay differential equations (SDDEs) and the Poisson approximation of aggregated gradient arrivals. In particular, we present the run time and staleness of distributed SGD without a memorylessness assumption on the computation times. Given the learning rate, we reveal the relevant SDDE's damping coefficient and its delay statistics, as functions of the number of activated clients, staleness threshold, the eigenvalues of the Hessian matrix of the objective function, and the overall computational/communication delay. The formulated SDDE allows us to present both the distributed SGD's convergence condition and speed by calculating its characteristic roots, thereby optimizing the scheduling policies for asynchronous/event-triggered SGD. It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness. Moreover, a small degree of staleness does not necessarily slow down the convergence, while a large degree of staleness will result in the divergence of distributed SGD. Numerical results demonstrate the potential of our SDDE framework, even in complex learning tasks with non-convex objective functions.

翻译：分布式随机梯度下降（SGD）因其在机器学习中扩展计算资源、减少训练时间以及帮助保护用户隐私的潜力，近年来受到广泛关注。然而，计算节点的异构性与有限的带宽可能导致随机的计算/通信延迟，从而严重阻碍学习过程。因此，如何通过高效调度多个工作节点来加速异步SGD成为一个重要课题。本文提出了一个统一框架，基于随机延迟微分方程（SDDE）和聚合梯度到达的泊松近似，用于分析和优化异步SGD的收敛性。特别地，我们在不假设计算时间具有无记忆性的前提下，给出了分布式SGD的运行时间和陈旧性分析。在给定学习率的情况下，我们揭示了相关SDDE的阻尼系数及其延迟统计量，这些量是激活客户端数量、陈旧性阈值、目标函数海森矩阵的特征值以及整体计算/通信延迟的函数。通过建立的SDDE模型，我们可以通过计算其特征根来给出分布式SGD的收敛条件和收敛速度，从而优化异步/事件触发SGD的调度策略。有趣的是，研究表明，由于陈旧性的存在，增加激活工作节点的数量并不一定能加速分布式SGD。此外，较小程度的陈旧性不一定会减慢收敛速度，而较大程度的陈旧性将导致分布式SGD发散。数值结果验证了我们提出的SDDE框架的潜力，即使在具有非凸目标函数的复杂学习任务中也是如此。