As deep neural networks (DNNs) grow in complexity and size, the resultant increase in communication overhead during distributed training has become a significant bottleneck, challenging the scalability of distributed training systems. Existing solutions, while aiming to mitigate this bottleneck through worker-level compression and in-network aggregation, fall short due to their inability to efficiently reconcile the trade-offs between compression effectiveness and computational overhead, hindering overall performance and scalability. In this paper, we introduce a novel compression algorithm that effectively merges worker-level compression with in-network aggregation. Our solution is both homomorphic, allowing for efficient in-network aggregation without CPU/GPU processing, and lossless, ensuring no compromise on training accuracy. Theoretically optimal in compression and computational efficiency, our approach is empirically validated across diverse DNN models such as NCF, LSTM, VGG19, and BERT-base, showing up to a 6.33$\times$ improvement in aggregation throughput and a 3.74$\times$ increase in per-iteration training speed.
翻译:随着深度神经网络(DNN)在复杂度与规模上的持续增长,分布式训练过程中产生的通信开销已成为制约训练系统可扩展性的关键瓶颈。现有解决方案虽试图通过工作节点级压缩与网内聚合缓解该瓶颈,但由于无法有效协调压缩效率与计算开销之间的权衡,导致整体性能与可扩展性受限。本文提出一种创新的压缩算法,该算法能够将工作节点级压缩与网内聚合有机融合。我们的方案兼具同态特性——无需CPU/GPU处理即可实现高效的网内聚合,以及无损特性——确保训练精度不受影响。该算法在压缩效率与计算性能上均达到理论最优,并在NCF、LSTM、VGG19及BERT-base等多样化深度神经网络模型上进行了实证验证,实验结果显示其聚合吞吐量提升高达6.33倍,单次迭代训练速度提升达3.74倍。