We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into $k$ stages and minimizing the running time of the bottleneck stage, including communication. We design practical algorithms for this NP-hard problem and show that they are nearly optimal in practice by comparing against strong lower bounds obtained via novel mixed-integer programming (MIP) formulations. We apply these algorithms and lower-bound methods to production models to achieve substantially improved approximation guarantees compared to standard combinatorial lower bounds. For example, evaluated via geometric means across production data with $k=16$ pipeline stages, our MIP formulations more than double the lower bounds, improving the approximation ratio from $2.175$ to $1.058$. This work shows that while max-throughput partitioning is theoretically hard, we have a handle on the algorithmic side of the problem in practice and much of the remaining challenge is in developing more accurate cost models to feed into the partitioning algorithms.
翻译:我们针对深度神经网络(DNN)推理中的流水线并行进行优化,通过将模型图划分为$k$个阶段,并最小化瓶颈阶段(包括通信开销)的运行时间。针对这一NP难问题,我们设计了实用的算法,并通过与新型混合整数规划(MIP)公式推导的强下界进行对比,证明了算法在实际中的近乎最优性。我们将这些算法与下界方法应用于生产模型,相比标准组合下界,获得了显著改进的近似保证。例如,在$k=16个流水线阶段的生产数据几何均值评估中,我们的MIP公式使下界提升了一倍以上,将近似比从$2.175$优化至$1.058$。这项工作表明,尽管最大吞吐量划分在理论上是难题,但我们在实际算法层面已能有效应对,而剩余的主要挑战在于开发更精确的成本模型以支持划分算法。