With the ever-increasing computational demand of DNN training workloads, distributed training has been widely adopted. A combination of data, model and pipeline parallelism strategy, called hybrid parallelism distributed training, is imported to tackle the problem of deploying large-scale models. However, how to evaluate the hybrid strategy and the utilization of each device remains a challenge since existing works either profile on a real large-scale cluster with high time and money costs or only analyze a specific type of parallelism without considering the hybrid parallelism. In this work, we proposed DistSim, an event-based performance model to accurately analyze each device's computation and communication activities with low profiling costs. DistDim breaks down the model into events according to the given distributed strategy, which can be profiled on two nodes. Then DistSim leverages the hierarchy of different parallel strategies to generate the computation and communication event-flow from layer level to model level and finally the activity timeline of each device participating in training. Experiment shows that DistSim can reach \revise{<4\%} errors when predicting distributing training batch time and \revise{<5\%} errors when predicting a single device's activity time in various hybrid strategy settings. We also provide a use-case of DistSim, automatically evaluate and search the best distributed training strategy, and find a hybrid strategy with at most $7.37\times$ throughput improvement.
翻译:随着深度神经网络训练工作负载的计算需求持续增长,分布式训练已被广泛采用。为应对大规模模型部署问题,引入了一种结合数据、模型和流水线并行策略的混合并行分布式训练方法。然而,如何评估混合策略及每台设备的利用率仍是一大挑战,现有工作要么在真实大规模集群上进行性能剖析(耗时且成本高昂),要么仅分析特定类型的并行性而未考虑混合并行。本文提出DistSim——一种基于事件的性能模型,能够以较低的分析成本准确分析每台设备的计算与通信活动。DistSim根据给定的分布式策略将模型分解为事件,这些事件可在两个节点上进行性能剖析。随后,DistSim利用不同并行策略的层次结构,从层级别到模型级别生成计算与通信事件流,最终形成参与训练的每台设备的活动时间线。实验表明,在多种混合策略设置下,DistSim预测分布式训练批次时间的误差小于4%,预测单台设备活动时间的误差小于5%。我们还提供了DistSim的应用案例:自动评估并搜索最优分布式训练策略,最终找到一种吞吐量提升高达7.37倍的混合策略。