This paper presents TAG, an automatic system to derive optimized DNN training graph and its deployment onto any device topology, for expedited training in device- and topology- heterogeneous ML clusters. We novelly combine both the DNN computation graph and the device topology graph as input to a graph neural network (GNN), and join the GNN with a search-based method to quickly identify optimized distributed training strategies. To reduce communication in a heterogeneous cluster, we further explore a lossless gradient compression technique and solve a combinatorial optimization problem to automatically apply the technique for training time minimization. We evaluate TAG with various representative DNN models and device topologies, showing that it can achieve up to 4.56x training speed-up as compared to existing schemes. TAG can produce efficient deployment strategies for both unseen DNN models and unseen device topologies, without heavy fine-tuning.
翻译:本文提出TAG,一个自动系统,用于推导优化的DNN训练图并将其部署到任意设备拓扑上,以在设备和拓扑异构的ML集群中实现加速训练。我们创新性地将DNN计算图与设备拓扑图作为图神经网络(GNN)的输入,并将GNN与基于搜索的方法相结合,以快速识别优化后的分布式训练策略。为降低异构集群中的通信开销,我们进一步探索了一种无损梯度压缩技术,并通过求解组合优化问题自动应用该技术以最小化训练时间。我们使用多种代表性DNN模型和设备拓扑对TAG进行评估,结果表明,与现有方案相比,它可实现高达4.56倍的训练加速。TAG能够为未见过的DNN模型和未见过的设备拓扑生成高效的部署策略,且无需大量微调。