Perfect synchronization in distributed machine learning problems is inefficient and even impossible due to the existence of latency, package losses and stragglers. We propose a Robust Fully-Asynchronous Stochastic Gradient Tracking method (R-FAST), where each device performs local computation and communication at its own pace without any form of synchronization. Different from existing asynchronous distributed algorithms, R-FAST can eliminate the impact of data heterogeneity across devices and allow for packet losses by employing a robust gradient tracking strategy that relies on properly designed auxiliary variables for tracking and buffering the overall gradient vector. More importantly, the proposed method utilizes two spanning-tree graphs for communication so long as both share at least one common root, enabling flexible designs in communication architectures. We show that R-FAST converges in expectation to a neighborhood of the optimum with a geometric rate for smooth and strongly convex objectives; and to a stationary point with a sublinear rate for general non-convex settings. Extensive experiments demonstrate that R-FAST runs 1.5-2 times faster than synchronous benchmark algorithms, such as Ring-AllReduce and D-PSGD, while still achieving comparable accuracy, and outperforms existing asynchronous SOTA algorithms, such as AD-PSGD and OSGP, especially in the presence of stragglers.
翻译:在分布式机器学习问题中,由于延迟、数据包丢失和掉队者(straggler)的存在,完美的同步是低效甚至不可能的。我们提出了一种鲁棒全异步随机梯度跟踪方法(R-FAST),其中每个设备以其自身的节奏执行本地计算和通信,无需任何形式的同步。与现有的异步分布式算法不同,R-FAST通过采用一种鲁棒的梯度跟踪策略,能够消除设备间数据异构性的影响,并允许数据包丢失;该策略依赖于精心设计的辅助变量来跟踪和缓冲整体梯度向量。更重要的是,只要两个生成树图共享至少一个公共根,所提出的方法就可以利用它们进行通信,从而实现通信架构的灵活设计。我们证明,对于光滑且强凸的目标函数,R-FAST能以几何速率在期望意义下收敛到最优解的一个邻域;对于一般的非凸设置,能以次线性速率收敛到一个平稳点。大量实验表明,R-FAST的运行速度比同步基准算法(如Ring-AllReduce和D-PSGD)快1.5-2倍,同时仍能达到相当的精度,并且优于现有的异步SOTA算法(如AD-PSGD和OSGP),尤其是在存在掉队者的情况下。