Foundation models are becoming the dominant deep learning technologies. Pretraining a foundation model is always time-consumed due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the training process is extremely memory-intensive and communication-intensive. These features make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism and tensor model parallelism, to achieve high training efficiency. To achieve this goal, some custom software frameworks such as Megatron-LM and DeepSpeed are developed. However, current 3D parallelism frameworks still meet two issues: i) they are not transparent to model developers, which need to manually modify the model to parallelize training. ii) their utilization of computation, GPU memory and network bandwidth are not sufficient. We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak also presents the non-intrusive API for scaling out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine in Merak. It uses several techniques to exploit available training resources, including shifted critical path pipeline schedule that brings a higher computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs show Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
翻译:摘要:基础模型正成为深度学习的主导技术。由于模型参数和训练数据集规模庞大,预训练基础模型始终耗时。除了计算密集型特征外,训练过程还极为消耗内存和通信资源。这些特性使得必须采用融合数据并行、流水线模型并行和张量模型并行的三维并行技术,以实现较高的训练效率。为此,业界开发了如Megatron-LM和DeepSpeed等定制化软件框架。然而,当前的三维并行框架仍存在两大问题:i) 对模型开发者不透明,需手动修改模型以实现并行训练;ii) 对计算资源、GPU内存和网络带宽的利用率不足。我们提出Merak——一种资源利用率高且自动化的三维并行深度学习训练框架。Merak通过自动模型分区器(利用图分片算法对模型代理表征进行处理)实现自动化部署,并采用非侵入式API,以最小代码修改量扩展基础模型训练。此外,我们在Merak中设计了高性能三维并行运行时引擎,通过多项技术充分挖掘训练资源:包括提升计算利用率的偏移关键路径流水线调度、利用空闲工作内存的阶段感知重计算,以及实现通信与计算重叠的子流水线张量模型并行。在64块GPU上的实验表明,针对参数量分别为1.5、2.5、8.3和200亿的模型,Merak的训练性能相比最先进的三维并行框架分别提升高达1.42倍、1.39倍、1.43倍和1.61倍。