PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction

Asynchronous pipeline model parallelism with a "1F1B" (one forward, one backward) schedule generates little bubble overhead and always provides quite a high throughput. However, the "1F1B" schedule inevitably leads to weight inconsistency and weight staleness issues due to the cross-training of different mini-batches across GPUs. To simultaneously address these two problems, in this paper, we propose an optimizer-dependent weight prediction strategy (a.k.a PipeOptim) for asynchronous pipeline training. The key insight of our proposal is that we employ a weight prediction strategy in the forward pass to ensure that each mini-batch uses consistent and staleness-free weights to compute the forward pass. To be concrete, we first construct the weight prediction scheme based on the update rule of the used optimizer when training the deep neural network models. Then throughout the "1F1B" pipelined training, each mini-batch is mandated to execute weight prediction ahead of the forward pass, subsequently employing the predicted weights to perform the forward pass. As a result, PipeOptim 1) inherits the advantage of the "1F1B" schedule and generates pretty high throughput, and 2) can ensure effective parameter learning regardless of the type of the used optimizer. To verify the effectiveness of our proposal, we conducted extensive experimental evaluations using eight different deep-learning models spanning three machine-learning tasks including image classification, sentiment analysis, and machine translation. The experiment results demonstrate that PipeOptim outperforms the popular pipelined approaches including GPipe, PipeDream, PipeDream-2BW, and SpecTrain. The code of PipeOptim can be accessible at https://github.com/guanleics/PipeOptim.

翻译：摘要：采用“1F1B”（一次前向、一次后向）调度的异步流水线模型并行技术产生的气泡开销极小，并能始终提供相当高的吞吐量。然而，“1F1B”调度不可避免地会导致权重不一致和权重陈旧性问题，这是由于不同GPU上不同小批量的交叉训练所致。为同时解决这两个问题，本文提出一种优化器相关的权重预测策略（即PipeOptim），用于异步流水线训练。我们方案的关键思路在于：在前向传播中采用权重预测策略，确保每个小批量使用一致且无陈旧的权重来计算前向传播。具体而言，我们首先根据训练深度神经网络模型时所用优化器的更新规则构建权重预测方案。随后，在“1F1B”流水线训练全程中，每个小批量被强制要求在前向传播之前执行权重预测，进而使用预测的权重进行前向传播。由此，PipeOptim可：1）继承“1F1B”调度的优势并产生相当高的吞吐量；2）无论使用何种类型的优化器，都能确保有效的参数学习。为验证我们方案的有效性，我们使用涵盖图像分类、情感分析和机器翻译三项机器学习任务的八种不同深度学习模型进行了广泛实验评估。实验结果表明，PipeOptim优于流行的流水线方法，包括GPipe、PipeDream、PipeDream-2BW和SpecTrain。PipeOptim的代码可在https://github.com/guanleics/PipeOptim获取。