Wall-clock convergence time and communication rounds are critical performance metrics in distributed learning with parameter-server setting. While synchronous methods converge fast but are not robust to stragglers; and asynchronous ones can reduce the wall-clock time per round but suffers from degraded convergence rate due to the staleness of gradients, it is natural to combine the two methods to achieve a balance. In this work, we develop a novel asynchronous strategy that leverages the advantages of both synchronous methods and asynchronous ones, named adaptive bounded staleness (ABS). The key enablers of ABS are two-fold. First, the number of workers that the PS waits for per round for gradient aggregation is adaptively selected to strike a straggling-staleness balance. Second, the workers with relatively high staleness are required to start a new round of computation to alleviate the negative effect of staleness. Simulation results are provided to demonstrate the superiority of ABS over state-of-the-art schemes in terms of wall-clock time and communication rounds.
翻译:摘要:在参数服务器设置下的分布式学习中,总收敛时间与通信轮次是关键性能指标。同步方法收敛速度快但鲁棒性差,易受掉队者影响;异步方法虽能减少每轮的总时间,却因梯度陈旧性导致收敛速率下降。因此,自然需要融合这两种方法以实现平衡。本研究提出了一种新型异步策略——自适应有界陈旧性(ABS),它兼具体同步与异步方法的优势。ABS的核心机制有两方面:第一,参数服务器每轮等待梯度聚合的工作节点数量被自适应选择,以平衡掉队与陈旧性;第二,陈旧性相对较高的工作节点需重新开启新一轮计算,以减轻陈旧性的负面影响。仿真结果表明,在总收敛时间与通信轮次方面,ABS相比现有最优方案具有显著优越性。