Deep Neural Networks (DNNs) have shown excellent performance in a wide range of machine learning applications. Knowing the latency of running a DNN model or tensor program on a specific device is useful in various tasks, such as DNN graph- or tensor-level optimization and device selection. Considering the large space of DNN models and devices that impede direct profiling of all combinations, recent efforts focus on building a predictor to model the performance of DNN models on different devices. However, none of the existing attempts have achieved a cost model that can accurately predict the performance of various tensor programs while supporting both training and inference accelerators. We propose CDMPP, an efficient tensor program latency prediction framework for both cross-model and cross-device prediction. We design an informative but efficient representation of tensor programs, called compact ASTs, and a pre-order-based positional encoding method, to capture the internal structure of tensor programs. We develop a domain-adaption-inspired method to learn domain-invariant representations and devise a KMeans-based sampling algorithm, for the predictor to learn from different domains (i.e., different DNN operators and devices). Our extensive experiments on a diverse range of DNN models and devices demonstrate that CDMPP significantly outperforms state-of-the-art baselines with 14.03% and 10.85% prediction error for cross-model and cross-device prediction, respectively, and one order of magnitude higher training efficiency. The implementation and the expanded dataset are available at https://github.com/joapolarbear/cdmpp.
翻译:深度神经网络(DNN)在广泛的机器学习应用中展现出卓越性能。在特定设备上运行DNN模型或张量程序的延迟信息,对于DNN图级或张量级优化及设备选择等任务具有重要价值。鉴于DNN模型与设备的组合空间庞大,直接对全体组合进行性能剖析难以实现,近期研究聚焦于构建能预测不同设备上DNN模型性能的预测器。然而,现有方法均未能实现既能精确预测各类张量程序性能,又同时支持训练与推理加速器的代价模型。我们提出CDMPP——一种面向跨模型与跨设备预测的高效张量程序延迟预测框架。我们设计了兼具信息性与高效性的张量程序表示——紧凑抽象语法树(compact AST),并提出了基于前序遍历的位置编码方法,以捕获张量程序的内部结构。我们开发了受领域自适应启发的学习方法以提取领域不变表示,并设计了基于K-means的采样算法,使预测器能够从不同领域(即不同DNN算子与设备)中学习。在多样化DNN模型与设备上的广泛实验表明,CDMPP在跨模型与跨设备预测中分别以14.03%和10.85%的预测误差显著超越现有最优基线方法,并实现了高出一个数量级的训练效率提升。实现代码及扩展数据集已开源至https://github.com/joapolarbear/cdmpp。