Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. The periodic synchronization at each iteration incurs considerable overhead, exacerbated by the increasing size and complexity of state-of-the-art neural networks. Although many gradient compression techniques propose to reduce communication cost, the ideal compression factor that leads to maximum speedup or minimum data exchange remains an open-ended problem since it varies with the quality of compression, model size and structure, hardware, network topology and bandwidth. We propose GraVAC, a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing gradient information loss associated with compression. GraVAC works in an online, black-box manner without any prior assumptions about a model or its hyperparameters, while achieving the same or better accuracy than dense SGD (i.e., no compression) in the same number of iterations/epochs. As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32x, 1.95x and 6.67x respectively. Compared to other adaptive schemes, our framework provides 1.94x to 5.63x overall speedup.
翻译:分布式数据并行(DDP)训练通过让多个设备在数据子集上进行训练并聚合更新以生成全局共享模型,从而提升整体应用吞吐量。每次迭代的周期性同步会产生显著开销,而随着最先进神经网络的规模和复杂度不断增加,这一开销愈发严重。尽管许多梯度压缩技术旨在降低通信成本,但由于最优压缩因子取决于压缩质量、模型规模与结构、硬件、网络拓扑及带宽等因素,实现最大加速比或最少数据交换的理想压缩因子仍是一个未解决的问题。我们提出GraVAC框架,该框架通过评估模型进展并分析压缩导致的梯度信息损失,在训练过程中动态调整压缩因子。GraVAC以在线、黑盒方式运行,无需对模型或其超参数进行任何先验假设,同时在相同迭代/轮次下达到与密集SGD(即无压缩)相同或更优的准确率。与使用静态压缩因子相比,GraVAC将ResNet101、VGG16和LSTM的端到端训练时间分别减少了4.32倍、1.95倍和6.67倍。与其他自适应方案相比,我们的框架实现了1.94倍至5.63倍的整体加速。