Single-Program-Multiple-Data (SPMD) parallelism has recently been adopted to train large deep neural networks (DNNs). Few studies have explored its applicability on heterogeneous clusters, to fully exploit available resources for large model learning. This paper presents \OurSystem, an automated system designed to expedite SPMD DNN training on heterogeneous clusters. \OurSystem jointly optimizes the tensor sharding strategy, sharding ratios across heterogeneous devices and the communication methods for tensor exchanges for optimized distributed training with SPMD parallelism. We novelly formulate model partitioning as a program synthesis problem, in which we generate a distributed program from scratch on a distributed instruction set that semantically resembles the program designed for a single device, and systematically explore the solution space with an A*-based search algorithm. We derive the optimal tensor sharding ratios by formulating it as a linear programming problem. Additionally, \OurSystem explores tensor communication optimization in a heterogeneous cluster and integrates it as part of the program synthesis process, for automatically choosing optimal collective communication primitives and applying sufficient factor broadcasting technique. Extensive experiments on representative workloads demonstrate that \OurSystem achieves up to 2.41x speed-up on heterogeneous clusters.
翻译:摘要:单程序多数据(SPMD)并行最近被用于训练大规模深度神经网络(DNN)。很少有研究探讨其在异构集群上的适用性,以充分利用可用资源进行大规模模型学习。本文提出了\OurSystem,一个旨在加速异构集群上SPMD DNN训练的自动化系统。\OurSystem联合优化了张量分片策略、跨异构设备的分片比例以及张量交换的通信方法,以实现基于SPMD并行的优化分布式训练。我们创新性地将模型划分建模为一个程序合成问题,在此问题中,我们从零开始在一个分布式指令集上生成一个分布式程序,该指令集在语义上类似于为单设备设计的程序,并使用基于A*的搜索算法系统地探索解空间。通过将最优张量分片比例建模为线性规划问题,我们推导出该比例。此外,\OurSystem在异构集群中探索张量通信优化,并将其集成到程序合成过程中,以自动选择最优的集合通信原语并应用充分的因子广播技术。在代表性工作负载上的大量实验表明,\OurSystem在异构集群上实现了高达2.41倍的加速效果。