Perfect synchronization in distributed machine learning problems is inefficient and even impossible due to the existence of latency, package losses and stragglers. We propose a Robust Fully-Asynchronous Stochastic Gradient Tracking method (R-FAST), where each device performs local computation and communication at its own pace without any form of synchronization. Different from existing asynchronous distributed algorithms, R-FAST can eliminate the impact of data heterogeneity across devices and allow for packet losses by employing a robust gradient tracking strategy that relies on properly designed auxiliary variables for tracking and buffering the overall gradient vector. More importantly, the proposed method utilizes two spanning-tree graphs for communication so long as both share at least one common root, enabling flexible designs in communication architectures. We show that R-FAST converges in expectation to a neighborhood of the optimum with a geometric rate for smooth and strongly convex objectives; and to a stationary point with a sublinear rate for general non-convex settings. Extensive experiments demonstrate that R-FAST runs 1.5-2 times faster than synchronous benchmark algorithms, such as Ring-AllReduce and D-PSGD, while still achieving comparable accuracy, and outperforms existing asynchronous SOTA algorithms, such as AD-PSGD and OSGP, especially in the presence of stragglers.
翻译:在分布式机器学习问题中,由于延迟、数据包丢失和掉队者的存在,完美同步效率低下甚至不可行。我们提出了一种鲁棒全异步随机梯度追踪方法(R-FAST),其中每个设备以其自身节奏执行本地计算和通信,无需任何形式的同步。与现有异步分布式算法不同,R-FAST通过采用一种鲁棒梯度追踪策略,该策略依赖于精心设计的辅助变量来追踪和缓冲整体梯度向量,从而消除设备间的数据异质性影响,并允许数据包丢失。更重要的是,所提出的方法利用两个生成树图进行通信,只要这两个图共享至少一个公共根节点,就支持通信架构的灵活设计。我们证明,对于光滑且强凸的目标函数,R-FAST以几何速率期望收敛到最优解的一个邻域内;对于一般非凸设置,则以次线性速率收敛到一个稳定点。大量实验表明,R-FAST的运行速度比同步基准算法(如Ring-AllReduce和D-PSGD)快1.5到2倍,同时仍能达到相当的精度,并且在存在掉队者的情况下,性能优于现有的异步SOTA算法(如AD-PSGD和OSGP)。