Preserving Near-Optimal Gradient Sparsification Cost for Scalable Distributed Deep Learning

Communication overhead is a major obstacle to scaling distributed training systems. Gradient sparsification is a potential optimization approach to reduce the communication volume without significant loss of model fidelity. However, existing gradient sparsification methods have low scalability owing to inefficient design of their algorithms, which raises the communication overhead significantly. In particular, gradient build-up and inadequate sparsity control methods degrade the sparsification performance considerably. Moreover, communication traffic increases drastically owing to workload imbalance of gradient selection between workers. To address these challenges, we propose a novel gradient sparsification scheme called ExDyna. In ExDyna, the gradient tensor of the model comprises fined-grained blocks, and contiguous blocks are grouped into non-overlapping partitions. Each worker selects gradients in its exclusively allocated partition so that gradient build-up never occurs. To balance the workload of gradient selection between workers, ExDyna adjusts the topology of partitions by comparing the workloads of adjacent partitions. In addition, ExDyna supports online threshold scaling, which estimates the accurate threshold of gradient selection on-the-fly. Accordingly, ExDyna can satisfy the user-required sparsity level during a training period regardless of models and datasets. Therefore, ExDyna can enhance the scalability of distributed training systems by preserving near-optimal gradient sparsification cost. In experiments, ExDyna outperformed state-of-the-art sparsifiers in terms of training speed and sparsification performance while achieving high accuracy.

翻译：通信开销是扩展分布式训练系统的主要障碍。梯度稀疏化是一种潜在的优化方法，可在不显著损失模型保真度的情况下减少通信量。然而，现有梯度稀疏化方法因算法设计低效导致可扩展性差，从而显著增加通信开销。具体而言，梯度累积和不精确的稀疏度控制方法会严重降低稀疏化性能。此外，由于工作节点间梯度选择的工作负载不平衡，通信流量急剧增加。针对这些挑战，我们提出一种名为ExDyna的新型梯度稀疏化方案。在ExDyna中，模型的梯度张量由细粒度块组成，连续块被分组为非重叠分区。每个工作节点在其专属分配分区中选择梯度，从而避免梯度累积。为平衡工作节点间梯度选择的工作负载，ExDyna通过比较相邻分区的工作负载来调整分区拓扑结构。此外，ExDyna支持在线阈值缩放，可实时估计梯度选择的精确阈值。因此，ExDyna能够在训练期间满足用户所需的稀疏度水平，且不受模型和数据集影响。通过保持近最优梯度稀疏化成本，ExDyna可提升分布式训练系统的可扩展性。实验表明，ExDyna在训练速度和稀疏化性能方面均优于最先进的稀疏化器，同时实现了高精度。