We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into $k$ stages and minimizing the running time of the bottleneck stage, including communication. We give practical and effective algorithms for this NP-hard problem, but our emphasis is on tackling the practitioner's dilemma of deciding when a solution is good enough. To this end, we design novel mixed-integer programming (MIP) relaxations for proving lower bounds. Applying these methods to a diverse testbed of 369 production models, for $k \in \{2, 4, 8, 16, 32, 64\}$, we empirically show that these lower bounds are strong enough to be useful in practice. Our lower bounds are substantially stronger than standard combinatorial bounds. For example, evaluated via geometric means across a production testbed with $k = 16$ pipeline stages, our MIP formulations raise the lower bound from 0.4598 to 0.9452, expressed as a fraction of the best partition found. In other words, our improved lower bounds close the optimality gap by a factor of 9.855x.
翻译:我们通过将模型图划分为 $k$ 个阶段并最小化瓶颈阶段(包括通信)的运行时间,来优化深度神经网络(DNN)推理的流水线并行。我们为这一 NP 难问题提供了实用且有效的算法,但我们的重点在于解决实践者关于何时解决方案足够好的决策困境。为此,我们设计了新颖的混合整数规划(MIP)松弛方法来证明下界。将这些方法应用于包含 369 个生产模型的多样化测试集,针对 $k \in \{2, 4, 8, 16, 32, 64\}$,我们通过实验证明这些下界足够强,在实践中具有实用性。我们的下界明显强于标准的组合下界。例如,在一个具有 $k = 16$ 个流水线阶段的生产测试集上,通过几何平均值进行评估,我们的 MIP 公式将下界从 0.4598 提高到 0.9452(表示为所找到的最佳划分的分数)。换言之,我们改进的下界将最优性差距缩小了 9.855 倍。