Training a deep neural network (DNN) requires substantial computational and memory requirements. It is common to use multiple devices to train a DNN to reduce the overall training time. There are several choices to parallelize each layer in a DNN. Exhaustively searching this list to find an optimal parallelization strategy is prohibitively time consuming and impractical. The standard practice is to use data parallelism because of its simplicity. However, data parallelism is often sub-optimal, and suffers from poor performance and high memory requirement. Expert-designed strategies have been proposed on a case-by-case basis using domain specific knowledge. These expert-designed strategies do not generalize well to DNNs other than the ones for which they were designed, and are not always necessarily the best choice. In this paper, we propose an approach to automatically find efficient parallelization strategies for DNNs from their computation graphs. We present an efficient algorithm to compute these strategies within a reasonable time in practice. We evaluate the effectiveness of our approach on various DNNs. We also compare the performance of the strategies identified by our approach against data parallelism, expert-designed strategies, and the state-of-the-art approaches. Our results show that the strategies found using our approach outperform the baseline data parallelism strategy in all the cases. In addition, our strategies achieve better performance than the expert-designed strategies and the state-of-the-art approaches.
翻译:训练深度神经网络(DNN)需要大量的计算和内存资源。通常使用多个设备来训练DNN以减少总体训练时间。对于DNN中的每一层,存在多种并行化方案选择。穷举搜索此列表以寻找最优并行化策略耗时过长且不切实际。标准做法因其简单性而采用数据并行。然而,数据并行往往并非最优,且存在性能不佳和内存需求高的问题。已有研究基于领域特定知识,针对具体案例提出了专家设计的策略。这些专家设计的策略难以泛化到其设计目标之外的DNN,且并非总是最佳选择。本文提出一种方法,能够根据DNN的计算图自动寻找高效的并行化策略。我们提出一种高效算法,可在实际可行时间内计算这些策略。我们在多种DNN上评估了所提方法的有效性,并将本方法识别的策略与数据并行、专家设计策略及前沿方法进行性能比较。实验结果表明,本方法发现的策略在所有案例中均优于基线数据并行策略。此外,我们的策略取得了比专家设计策略和前沿方法更优的性能。