The escalating size of Deep Neural Networks (DNNs) has spurred a growing research interest in hosting and serving DNN models across multiple devices. A number of studies have been reported to partition a DNN model across devices, providing device placement solutions. The methods appeared in the literature, however, either suffer from poor placement performance due to the exponential search space or miss an optimal placement as a consequence of the reduced search space with limited heuristics. Moreover, these methods have ignored the runtime inter-operator optimization of a computation graph when coarsening the graph, which degrades the end-to-end inference performance. This paper presents Moirai that better exploits runtime inter-operator fusion in a model to render a coarsened computation graph, reducing the search space while maintaining the inter-operator optimization provided by inference backends. Moirai also generalizes the device placement algorithm from multiple perspectives by considering inference constraints and device heterogeneity.Extensive experimental evaluation with 11 large DNNs demonstrates that Moirai outperforms the state-of-the-art counterparts, i.e., Placeto, m-SCT, and GETF, up to 4.28$\times$ in reduction of the end-to-end inference latency. Moirai code is anonymously released at \url{https://github.com/moirai-placement/moirai}.
翻译:深度神经网络(DNN)规模的不断增长,促使越来越多研究关注跨多设备托管与服务DNN模型。已有研究提出将DNN模型分割到多设备上,并提供设备放置方案。然而,现有方法或受限于指数级搜索空间导致放置性能不佳,或因采用有限启发式缩减搜索空间而错失最优放置。此外,这些方法在粗化计算图时忽略了运行时算子间优化,导致端到端推理性能下降。本文提出Moirai,通过充分利用模型中的运行时算子融合生成粗化计算图,在保留推理后端提供的算子间优化的同时缩减搜索空间。Moirai还从多个维度泛化设备放置算法,综合考虑推理约束与设备异构性。在11个大型DNN上的广泛实验表明,Moirai相比现有最优方法(Placeto、m-SCT和GETF)可将端到端推理延迟降低最多4.28倍。Moirai代码已匿名发布于\url{https://github.com/moirai-placement/moirai}。