Top-$k$ sparsification has recently been widely used to reduce the communication volume in distributed deep learning; however, due to Gradient Accumulation (GA) dilemma, the performance of top-$k$ sparsification is still limited. Several methods have been proposed to handle the GA dilemma but have two drawbacks: (1) they are frustrated by the high communication complexity as they introduce a large amount of extra transmission; (2) they are not flexible for non-power-of-two numbers of workers. To solve these two problems, we propose a flexible and efficient sparse communication framework, dubbed SparDL. SparDL uses the Spar-Reduce-Scatter algorithm to solve the GA dilemma without additional communication operations and is flexible to any number of workers. Besides, to further reduce the communication complexity and adjust the proportion of latency and bandwidth cost in communication complexity, we propose the Spar-All-Gather algorithm as part of SparDL. Extensive experiments validate the superiority of SparDL.
翻译:Top-$k$ 稀疏化技术近年来被广泛应用于降低分布式深度学习中的通信量,然而由于梯度累积(GA)困境,其性能仍然受限。现有处理方法存在两个缺陷:(1)因引入大量额外传输导致通信复杂度显著升高;(2)无法灵活适配非2的幂次工作节点数量。针对上述问题,本文提出一种灵活高效的稀疏通信框架SparDL。该框架采用稀疏-归约-散列(Spar-Reduce-Scatter)算法在无需额外通信操作的情况下解决GA困境,且能适用于任意数量的工作节点。此外,为降低通信复杂度并调整延迟与带宽开销的占比,本文进一步提出稀疏-全收集(Spar-All-Gather)算法作为SparDL的组成部分。大量实验验证了SparDL的优越性能。