Communication bottlenecks hinder the scalability of distributed neural network training, particularly on distributed-memory computing clusters. To significantly reduce this communication overhead, we introduce AB-training, a novel data-parallel training method that decomposes weight matrices into low-rank representations and utilizes independent group-based training. This approach consistently reduces network traffic by 50% across multiple scaling scenarios, increasing the training potential on communication-constrained systems. Our method exhibits regularization effects at smaller scales, leading to improved generalization for models like VGG16, while achieving a remarkable 44.14 : 1 compression ratio during training on CIFAR-10 and maintaining competitive accuracy. Albeit promising, our experiments reveal that large batch effects remain a challenge even in low-rank training regimes.
翻译:通信瓶颈阻碍了分布式神经网络训练的可扩展性,尤其是在分布式内存计算集群中。为显著降低通信开销,我们提出了AB-training——一种新型数据并行训练方法,该方法将权重矩阵分解为低秩表示,并采用基于独立分组的方式进行训练。该方法在多种扩展场景下持续将网络流量减少50%,从而提升通信受限系统上的训练潜力。在小规模场景下,该方法展现出正则化效应,改进了VGG16等模型的泛化能力,同时在CIFAR-10数据集上训练期间实现了惊人的44.14:1压缩比,并保持了具有竞争力的准确率。尽管前景可期,但我们的实验表明,即使在低秩训练机制中,大批量效应仍是一个挑战。