In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present \texttt{SelSync}, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of \texttt{SelSync} to improve convergence in the context of \textit{semi-synchronous} training. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$\times$.
翻译:在分布式训练中,深度神经网络(DNN)在多个工作节点上同时启动,并在批量同步并行(BSP)训练的每一步中聚合各自的局部更新。然而,由于聚合的高通信开销,BSP无法实现线性扩展。为缓解这一开销,诸如联合平均(FedAvg)和延迟同步并行(SSP)等替代方案要么降低同步频率,要么完全消除同步,但这通常以降低最终精度为代价。在本文中,我们提出\texttt{SelSync}——一种实用、低开销的DNN训练方法,该方法在每一步中根据局部更新的重要性动态选择执行通信或避免通信,具体通过调用聚合操作或应用局部更新来实现。我们提出多种优化作为\texttt{SelSync}的一部分,以改进\textit{半同步}训练中的收敛性。我们的系统能够达到与BSP相同或更高的精度,同时将训练时间缩短高达14倍。