As the size of models and datasets grows, it has become increasingly common to train models in parallel. However, existing distributed stochastic gradient descent (SGD) algorithms suffer from insufficient utilization of computational resources and poor convergence in heterogeneous clusters. In this paper, we propose a delayed synchronous SGD algorithm with adaptive batch size (ABS-SGD) for heterogeneous GPU clusters. In ABS-SGD, workers perform global synchronization to accumulate delayed gradients and use the accumulated delayed gradients to update parameters. While workers are performing global synchronization for delayed gradients, they perform the computation of the next batch without specifying batch size in advance, which lasts until the next global synchronization starts, realizing the full utilization of computational resources. Since the gradient delay is only one iteration, the stale gradient problem can be alleviated. We theoretically prove the convergence of ABS-SGD in heterogeneous clusters. Extensive experiments in three types of heterogeneous clusters demonstrate that ABS-SGD can make full use of computational resources and accelerate model convergence: When training ResNet18 network with 4 workers, ABS-SGD increases the convergence speed by 1.30x on average compared with the best baseline algorithm.
翻译:随着模型和数据集规模的不断增长,并行训练模型已成为日益普遍的做法。然而,现有的分布式随机梯度下降(SGD)算法在异构集群中存在计算资源利用率不足和收敛性差的问题。本文针对异构GPU集群,提出了一种采用自适应批量大小的延迟同步SGD算法(ABS-SGD)。在ABS-SGD中,工作节点通过全局同步累积延迟梯度,并利用累积的延迟梯度更新参数。当工作节点为延迟梯度进行全局同步时,它们无需预先指定批量大小即可开始下一批次的运算,该运算持续至下一次全局同步开始,从而实现了计算资源的充分利用。由于梯度延迟仅为一个迭代步长,梯度陈旧问题得以缓解。我们从理论上证明了ABS-SGD在异构集群中的收敛性。在三种异构集群上进行的大量实验表明,ABS-SGD能够充分利用计算资源并加速模型收敛:在4个工作节点训练ResNet18网络时,与最优基线算法相比,ABS-SGD平均提升收敛速度1.30倍。