AutoDDL: Automatic Distributed Deep Learning with Near-Optimal Bandwidth Cost

Recent advances in deep learning are driven by the growing scale of computation, data, and models. However, efficiently training large-scale models on distributed systems requires an intricate combination of data, operator, and pipeline parallelism, which exerts heavy burden on machine learning practitioners. To this end, we propose AutoDDL, a distributed training framework that automatically explores and exploits new parallelization schemes with near-optimal bandwidth cost. AutoDDL facilitates the description and implementation of different schemes by utilizing OneFlow's Split, Broadcast, and Partial Sum (SBP) abstraction. AutoDDL is equipped with an analytical performance model combined with a customized Coordinate Descent algorithm, which significantly reduces the scheme searching overhead. We conduct evaluations on Multi-Node-Single-GPU and Multi-Node-Multi-GPU machines using different models, including VGG and Transformer. Compared to the expert-optimized implementations, AutoDDL reduces the end-to-end training time by up to 31.1% and 10% for Transformer and up to 17.7% and 71.5% for VGG on the two parallel systems, respectively.

翻译：最近深度学习的进展得益于计算规模、数据和模型的不断增长。然而，在分布式系统上高效训练大规模模型需要数据并行、算子并行和流水线并行的复杂组合，这给机器学习从业者带来了沉重负担。为此，我们提出了AutoDDL，一个自动探索和利用近乎最优带宽成本的新并行化方案的分布式训练框架。AutoDDL利用OneFlow的拆分、广播和部分求和抽象来简化不同方案的描述和实现。AutoDDL配备了结合定制化坐标下降算法的分析性能模型，显著降低了方案搜索开销。我们在多节点单GPU和多节点多GPU机器上使用不同模型（包括VGG和Transformer）进行了评估。与专家优化实现相比，在两种并行系统上，AutoDDL分别将Transformer的端到端训练时间减少了高达31.1%和10%，将VGG的端到端训练时间减少了高达17.7%和71.5%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/