Top-k sparsification has recently been widely used to reduce the communication volume in distributed deep learning. However, due to the Sparse Gradient Accumulation (SGA) dilemma, the performance of top-k sparsification still has limitations. Recently, a few methods have been put forward to handle the SGA dilemma. Regrettably, even the state-of-the-art method suffers from several drawbacks, e.g., it relies on an inefficient communication algorithm and requires extra transmission steps. Motivated by the limitations of existing methods, we propose a novel efficient sparse communication framework, called SparDL. Specifically, SparDL uses the Spar-Reduce-Scatter algorithm, which is based on an efficient Reduce-Scatter model, to handle the SGA dilemma without additional communication operations. Besides, to further reduce the latency cost and improve the efficiency of SparDL, we propose the Spar-All-Gather algorithm. Moreover, we propose the global residual collection algorithm to ensure fast convergence of model training. Finally, extensive experiments are conducted to validate the superiority of SparDL.
翻译:Top-k稀疏化已被广泛用于降低分布式深度学习中的通信量。然而,由于稀疏梯度累积(SGA)困境,top-k稀疏化的性能仍存在局限性。近期虽提出若干方法解决SGA困境,但遗憾的是,即使是最先进的方法也存在若干缺陷,例如依赖低效的通信算法且需要额外的传输步骤。受现有方法局限性的启发,我们提出一种名为SparDL的新型高效稀疏通信框架。具体而言,SparDL采用基于高效Reduce-Scatter模型的Spar-Reduce-Scatter算法,在不增加额外通信操作的情况下解决SGA困境。此外,为进一步降低延迟开销并提升SparDL效率,我们提出Spar-All-Gather算法。同时,提出全局残差收集算法以确保模型训练的快速收敛。最终,通过大量实验验证了SparDL的优越性。