Efficient distributed training is a principal driver of recent advances in deep learning. However, communication often proves costly and becomes the primary bottleneck in these systems. As a result, there is a demand for the design of efficient communication mechanisms that can empirically boost throughput while providing theoretical guarantees. In this work, we introduce Global-QSGD, a novel family of quantization operators, engineered to accelerate distributed training based on global scaling. We demonstrate that Global-QSGD is the first theoretically rigorous Allreduce-compatible compression mechanism that achieves a provable speed-up by striking a balance between compression error and communication savings. Importantly, Global-QSGD does not rely on costly error feedback due to its inherent unbiasedness and offers up to $O(\sqrt{n})$ additional compression ratio compared to the popular QSGD quantization ($n$ represents the number of workers). To obtain theoretical guarantees, we generalize the notion of standard unbiased compression operators to incorporate Global-QSGD. We show that this wider class permits standard analysis for unbiased compressors and thus ensures convergence for popular optimization algorithms (e.g., distributed SGD) under typical settings. For the empirical component of our work, we carry out a performance modeling analysis to determine if Global-QSGD can enhance training throughput under specific hardware configurations. We also conduct extensive empirical evaluations on various tasks, testing our theory on both NVLink and PCIe connections as well as a large-scale cloud system.
翻译:摘要:高效的分布式训练是近期深度学习进展的主要驱动力。然而,通信过程往往成本高昂,并成为这些系统中的首要瓶颈。因此,亟需设计既能通过实证提升吞吐量又能提供理论保证的高效通信机制。本文提出了一种新型量化算子族Global-QSGD,其基于全局缩放机制旨在加速分布式训练。我们证明,Global-QSGD是首个在压缩误差与通信节约之间实现可证明加速的理论严谨的Allreduce兼容压缩机制。重要的是,由于Global-QSGD具有内在无偏性,其无需依赖代价高昂的误差反馈机制,并且相比于流行的QSGD量化方法(其中n表示工作节点数),可额外提供高达$O(\sqrt{n})$的压缩比。为获得理论保证,我们将标准无偏压缩算子的概念加以推广以涵盖Global-QSGD。我们证明,这更广泛的类别允许对无偏压缩器进行标准分析,从而在典型设置下确保主流优化算法(如分布式SGD)的收敛性。在实证部分,我们通过性能建模分析,确定Global-QSGD在特定硬件配置下能否提升训练吞吐量。此外,我们还在各类任务上开展广泛的实证评估,分别在NVLink和PCIe连接以及大规模云计算系统上验证了相关理论。