Wall-clock convergence time and communication rounds are critical performance metrics in distributed learning with parameter-server setting. While synchronous methods converge fast but are not robust to stragglers; and asynchronous ones can reduce the wall-clock time per round but suffers from degraded convergence rate due to the staleness of gradients, it is natural to combine the two methods to achieve a balance. In this work, we develop a novel asynchronous strategy that leverages the advantages of both synchronous methods and asynchronous ones, named adaptive bounded staleness (ABS). The key enablers of ABS are two-fold. First, the number of workers that the PS waits for per round for gradient aggregation is adaptively selected to strike a straggling-staleness balance. Second, the workers with relatively high staleness are required to start a new round of computation to alleviate the negative effect of staleness. Simulation results are provided to demonstrate the superiority of ABS over state-of-the-art schemes in terms of wall-clock time and communication rounds.
翻译:摘要:在参数服务器架构的分布式学习中,时钟时间收敛速度与通信轮次是关键性能指标。同步方法收敛快但对落后者鲁棒性不足,异步方法虽能缩短单轮时钟时间,却因梯度延迟导致收敛效率下降。因此,融合两者优势以达成平衡成为自然选择。本文提出一种创新的异步策略——自适应有界延迟(ABS),该策略兼具同步与异步方法的优势。ABS的核心机制包含两点:其一,通过自适应选择参数服务器每轮等待的梯度聚合工作节点数量,实现落后-延迟权衡;其二,强制延迟较高的工作节点启动新一轮计算,以缓解延迟的负面影响。仿真结果表明,在时钟时间与通信轮次指标上,ABS显著优于现有最优方案。