Gradient sparsification is a widely adopted solution for reducing the excessive communication traffic in distributed deep learning. However, most existing gradient sparsifiers have relatively poor scalability because of considerable computational cost of gradient selection and/or increased communication traffic owing to gradient build-up. To address these challenges, we propose a novel gradient sparsification scheme, DEFT, that partitions the gradient selection task into sub tasks and distributes them to workers. DEFT differs from existing sparsifiers, wherein every worker selects gradients among all gradients. Consequently, the computational cost can be reduced as the number of workers increases. Moreover, gradient build-up can be eliminated because DEFT allows workers to select gradients in partitions that are non-intersecting (between workers). Therefore, even if the number of workers increases, the communication traffic can be maintained as per user requirement. To avoid the loss of significance of gradient selection, DEFT selects more gradients in the layers that have a larger gradient norm than the other layers. Because every layer has a different computational load, DEFT allocates layers to workers using a bin-packing algorithm to maintain a balanced load of gradient selection between workers. In our empirical evaluation, DEFT shows a significant improvement in training performance in terms of speed in gradient selection over existing sparsifiers while achieving high convergence performance.
翻译:梯度稀疏化是降低分布式深度学习中海量通信流量的广泛采用方案。然而,现有梯度稀疏化方法因梯度选择的计算成本较高,或梯度累积导致通信流量增加,普遍存在可扩展性较差的问题。针对这些挑战,我们提出了一种新型梯度稀疏化方案DEFT,该方案将梯度选择任务分解为子任务并分配给工作节点。与现有稀疏化方法不同——后者要求每个工作节点在所有梯度中进行选择——DEFT允许每个工作节点仅在非重叠(工作节点间)的分区中选择梯度。因此,随着工作节点数量增加,计算成本可逐步降低。同时,由于工作节点在互不重叠的分区中选择梯度,梯度累积问题得以消除,即使工作节点数量增加,通信流量也能根据用户需求保持稳定。为避免梯度选择丧失重要性,DEFT在梯度范数较大的层中选择更多梯度。针对不同层计算负载差异,我们采用装箱算法将各层分配给工作节点,以保持梯度选择负载的均衡性。实验评估表明,与现有稀疏化方法相比,DEFT在梯度选择速度方面显著提升了训练性能,同时实现了优异的收敛性能。