In distributed machine learning, a central node outsources computationally expensive calculations to external worker nodes. The properties of optimization procedures like stochastic gradient descent (SGD) can be leveraged to mitigate the effect of unresponsive or slow workers called stragglers, that otherwise degrade the benefit of outsourcing the computation. This can be done by only waiting for a subset of the workers to finish their computation at each iteration of the algorithm. Previous works proposed to adapt the number of workers to wait for as the algorithm evolves to optimize the speed of convergence. In contrast, we model the communication and computation times using independent random variables. Considering this model, we construct a novel scheme that adapts both the number of workers and the computation load throughout the run-time of the algorithm. Consequently, we improve the convergence speed of distributed SGD while significantly reducing the computation load, at the expense of a slight increase in communication load.
翻译:在分布式机器学习中,中心节点将计算量大的任务外包给外部工作节点。利用随机梯度下降(SGD)等优化过程的特性,可以缓解无响应或计算缓慢的工作节点(称为迟滞节点)带来的影响,否则这些节点将削弱外包计算的优势。具体方法是在算法每次迭代时仅等待部分工作节点完成计算。已有研究提出随着算法演进动态调整等待的工作节点数量以优化收敛速度。相比之下,我们采用独立随机变量对通信和计算时间进行建模。基于该模型,我们提出了一种新颖方案,能够在算法运行过程中同时自适应调整工作节点数量与计算负载。最终,我们在略微增加通信负载的代价下,有效提升了分布式SGD的收敛速度并显著降低了计算负载。