In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present \texttt{SelSync}, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of \texttt{SelSync} to improve convergence in the context of \textit{semi-synchronous} training. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$\times$.
翻译:在分布式训练中,深度神经网络(DNN)在多个工作节点上并发执行,并在批量同步并行(BSP)训练的每一步中聚合各自的本地更新。然而,由于聚合的高通信成本,BSP无法实现线性扩展。为了缓解这一开销,联邦平均(FedAvg)和陈旧同步并行(SSP)等替代方法要么降低同步频率,要么完全消除同步,但通常以牺牲最终精度为代价。本文提出\texttt{SelSync},一种实用且低开销的DNN训练方法,它通过基于更新的重要性在每一步动态选择调用聚合操作或应用本地更新,从而决定是否进行通信。作为\texttt{SelSync}的一部分,我们提出了多种优化手段,以在\textit{半同步}训练场景下改善收敛性。我们的系统在达到与BSP相同或更高精度的同时,将训练时间最高缩短了14倍。