Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently training these models across multiple GPUs remains a complex challenge due to the abundance of parallelism options. Existing DL systems either require manual efforts to design distributed training plans or limit parallelism combinations to a constrained search space. In this paper, we present Galvatron-BMW, a novel system framework that integrates multiple prevalent parallelism dimensions and automatically identifies the most efficient hybrid parallelism strategy. To effectively navigate this vast search space, we employ a decision tree approach for decomposition and pruning based on intuitive insights. We further utilize a dynamic programming search algorithm to derive the optimal plan. Moreover, to improve resource utilization and enhance system efficiency, we propose a bi-objective optimization workflow that focuses on workload balance. Our evaluations on different Transformer models demonstrate the capabilities of Galvatron-BMW in automating distributed training under varying GPU memory constraints. Across all tested scenarios, Galvatron-BMW consistently achieves superior system throughput, surpassing previous approaches that rely on limited parallelism strategies.
翻译:Transformer模型已成为在各种应用领域实现最先进性能的主要方法,为先进的大规模深度学习模型奠定了基础。然而,由于并行选项众多,跨多个GPU高效训练这些模型仍然是一个复杂的挑战。现有的深度学习系统要么需要手动设计分布式训练计划,要么将并行组合限制在受限的搜索空间中。在本文中,我们提出Galvatron-BMW,一种新颖的系统框架,它整合了多种常见的并行维度,并自动识别最高效的混合并行策略。为了有效导航这一庞大的搜索空间,我们基于直观见解采用决策树方法进行分解和剪枝。我们进一步利用动态规划搜索算法推导最优计划。此外,为了提升资源利用率和系统效率,我们提出了一种专注于工作负载平衡的双目标优化工作流。我们在不同Transformer模型上的评估展示了Galvatron-BMW在GPU内存约束变化下自动化分布式训练的能力。在所有测试场景中,Galvatron-BMW始终实现了卓越的系统吞吐量,超越了以往依赖有限并行策略的方法。