Recent advances in deep learning are driven by the growing scale of computation, data, and models. However, efficiently training large-scale models on distributed systems requires an intricate combination of data, operator, and pipeline parallelism, which exerts heavy burden on machine learning practitioners. To this end, we propose AutoDDL, a distributed training framework that automatically explores and exploits new parallelization schemes with near-optimal bandwidth cost. AutoDDL facilitates the description and implementation of different schemes by utilizing OneFlow's Split, Broadcast, and Partial Sum (SBP) abstraction. AutoDDL is equipped with an analytical performance model combined with a customized Coordinate Descent algorithm, which significantly reduces the scheme searching overhead. We conduct evaluations on Multi-Node-Single-GPU and Multi-Node-Multi-GPU machines using different models, including VGG and Transformer. Compared to the expert-optimized implementations, AutoDDL reduces the end-to-end training time by up to 31.1% and 10% for Transformer and up to 17.7% and 71.5% for VGG on the two parallel systems, respectively.
翻译:最近深度学习的进展得益于计算规模、数据和模型的不断增长。然而,在分布式系统上高效训练大规模模型需要数据并行、算子并行和流水线并行的复杂组合,这给机器学习从业者带来了沉重负担。为此,我们提出了AutoDDL,一个自动探索和利用近乎最优带宽成本的新并行化方案的分布式训练框架。AutoDDL利用OneFlow的拆分、广播和部分求和抽象来简化不同方案的描述和实现。AutoDDL配备了结合定制化坐标下降算法的分析性能模型,显著降低了方案搜索开销。我们在多节点单GPU和多节点多GPU机器上使用不同模型(包括VGG和Transformer)进行了评估。与专家优化实现相比,在两种并行系统上,AutoDDL分别将Transformer的端到端训练时间减少了高达31.1%和10%,将VGG的端到端训练时间减少了高达17.7%和71.5%。