Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device. Pipeline parallelism is commonly used in existing DNN systems to support large-scale DNN training by partitioning a DNN into multiple stages, which concurrently perform DNN training for different micro-batches in a pipeline fashion. However, existing pipeline-parallel approaches only consider sequential pipeline stages and thus ignore the topology of a DNN, resulting in missed model-parallel opportunities. This paper presents graph pipeline parallelism (GPP), a new pipeline-parallel scheme that partitions a DNN into pipeline stages whose dependencies are identified by a directed acyclic graph. GPP generalizes existing sequential pipeline parallelism and preserves the inherent topology of a DNN to enable concurrent execution of computationally-independent operators, resulting in reduced memory requirement and improved GPU performance. In addition, we develop GraphPipe, a distributed system that exploits GPP strategies to enable performant and scalable DNN training. GraphPipe partitions a DNN into a graph of stages, optimizes micro-batch schedules for these stages, and parallelizes DNN training using the discovered GPP strategies. Evaluation on a variety of DNNs shows that GraphPipe outperforms existing pipeline-parallel systems such as PipeDream and Piper by up to 1.6X. GraphPipe also reduces the search time by 9-21X compared to PipeDream and Piper.
翻译:深度神经网络(DNN)的规模持续快速增长,导致在单一设备上训练已不可行。流水线并行是现有DNN系统中广泛采用的技术,通过将DNN划分为多个阶段,以流水线方式对不同微批次并发执行训练,从而支持大规模DNN训练。然而,现有流水线并行方法仅考虑顺序流水线阶段,因而忽略了DNN的拓扑结构,错失了模型并行的优化机会。本文提出图流水线并行(GPP),这是一种新的流水线并行方案,将DNN划分为依赖关系由有向无环图标识的流水线阶段。GPP推广了现有的顺序流水线并行,保留了DNN的固有拓扑结构,使得计算独立的算子能够并发执行,从而降低内存需求并提升GPU性能。此外,我们开发了GraphPipe分布式系统,该系统利用GPP策略实现高性能、可扩展的DNN训练。GraphPipe将DNN划分为阶段图,优化这些阶段的微批次调度,并基于发现的GPP策略并行化DNN训练。在多种DNN上的评估表明,GraphPipe在性能上超越PipeDream、Piper等现有流水线并行系统达1.6倍。与PipeDream和Piper相比,GraphPipe还将策略搜索时间减少了9至21倍。