Wall-clock convergence time and communication rounds are critical performance metrics in distributed learning with parameter-server setting. While synchronous methods converge fast but are not robust to stragglers; and asynchronous ones can reduce the wall-clock time per round but suffers from degraded convergence rate due to the staleness of gradients, it is natural to combine the two methods to achieve a balance. In this work, we develop a novel asynchronous strategy that leverages the advantages of both synchronous methods and asynchronous ones, named adaptive bounded staleness (ABS). The key enablers of ABS are two-fold. First, the number of workers that the PS waits for per round for gradient aggregation is adaptively selected to strike a straggling-staleness balance. Second, the workers with relatively high staleness are required to start a new round of computation to alleviate the negative effect of staleness. Simulation results are provided to demonstrate the superiority of ABS over state-of-the-art schemes in terms of wall-clock time and communication rounds.
翻译:在以参数服务器为架构的分布式学习中,实际收敛时间与通信轮次是关键的衡量指标。同步方法收敛速度快但对慢节点不鲁棒;异步方法虽能缩短单轮实际时间,却受梯度陈旧度影响导致收敛率下降,因此自然想到将两者结合以实现平衡。本研究提出一种名为自适应有界陈旧度(ABS)的新型异步策略,融合了同步与异步方法的优势。ABS的核心机制包含两点:首先,参数服务器每轮等待进行梯度聚合的工作节点数量可自适应调整,以在慢节点影响与陈旧度之间取得平衡;其次,要求具有较高陈旧度的工作节点重新开始新一轮计算,以减轻陈旧度的负面效应。仿真结果证明,在实际收敛时间与通信轮次方面,ABS方案优于当前最优方案。