Multi-task model training has been adopted to enable a single deep neural network model (often a large language model) to handle multiple tasks (e.g., question answering and text summarization). Multi-task training commonly receives input sequences of highly different lengths due to the diverse contexts of different tasks. Padding (to the same sequence length) or packing (short examples into long sequences of the same length) is usually adopted to prepare input samples for model training, which is nonetheless not space or computation efficient. This paper proposes a dynamic micro-batching approach to tackle sequence length variation and enable efficient multi-task model training. We advocate pipeline-parallel training of the large model with variable-length micro-batches, each of which potentially comprises a different number of samples. We optimize micro-batch construction using a dynamic programming-based approach, and handle micro-batch execution time variation through dynamic pipeline and communication scheduling, enabling highly efficient pipeline training. Extensive evaluation on the FLANv2 dataset demonstrates up to 4.39x higher training throughput when training T5, and 3.25x when training GPT, as compared with packing-based baselines. DynaPipe's source code is publicly available at https://github.com/awslabs/optimizing-multitask-training-through-dynamic-pipelines.
翻译:多任务模型训练已被广泛采用,使单个深度神经网络模型(通常是大语言模型)能够处理多个任务(例如问答和文本摘要)。由于不同任务的上下文差异,多任务训练通常接收长度差异极大的输入序列。传统上,采用填充(将序列统一填充至相同长度)或打包(将短样本组合成长序列以达到统一长度)的方法来准备训练样本,但这在空间或计算效率上并不高效。本文提出了一种动态微批次方法,以应对序列长度变化问题,实现高效的多任务模型训练。我们倡导采用变长微批次对大型模型进行流水线并行训练,每个微批次可能包含不同数量的样本。我们利用基于动态规划的方法优化微批次的构建,并通过动态流水线和通信调度处理微批次执行时间的变化,从而实现高效的流水线训练。在FLANv2数据集上的大量评估表明,与基于打包的基线相比,训练T5时的吞吐量提升高达4.39倍,训练GPT时提升高达3.25倍。DynaPipe的源代码已在https://github.com/awslabs/optimizing-multitask-training-through-dynamic-pipelines 公开。