Deep Neural Networks (DNNs) have shown excellent performance in a wide range of machine learning applications. Knowing the latency of running a DNN model or tensor program on a specific device is useful in various tasks, such as DNN graph- or tensor-level optimization and device selection. Considering the large space of DNN models and devices that impede direct profiling of all combinations, recent efforts focus on building a predictor to model the performance of DNN models on different devices. However, none of the existing attempts have achieved a cost model that can accurately predict the performance of various tensor programs while supporting both training and inference accelerators. We propose CDMPP, an efficient tensor program latency prediction framework for both cross-model and cross-device prediction. We design an informative but efficient representation of tensor programs, called compact ASTs, and a pre-order-based positional encoding method, to capture the internal structure of tensor programs. We develop a domain-adaption-inspired method to learn domain-invariant representations and devise a KMeans-based sampling algorithm, for the predictor to learn from different domains (i.e., different DNN operators and devices). Our extensive experiments on a diverse range of DNN models and devices demonstrate that CDMPP significantly outperforms state-of-the-art baselines with 14.03% and 10.85% prediction error for cross-model and cross-device prediction, respectively, and one order of magnitude higher training efficiency. The implementation and the expanded dataset are available at https://github.com/joapolarbear/cdmpp.
翻译:深度神经网络(DNN)在众多机器学习应用中展现了卓越性能。在特定设备上运行DNN模型或张量程序的延迟信息,对于DNN图级/张量级优化及设备选择等任务具有重要价值。鉴于DNN模型与设备的庞大组合空间阻碍了全量直接测量,近期研究着重构建预测器来建模不同设备上DNN模型的性能。然而,现有方法均未能同时实现支持训练与推理加速器,且能准确预测多种张量程序性能的成本模型。我们提出CDMPP——一种面向跨模型与跨设备预测的高效张量程序延迟预测框架。我们设计了兼具信息性与高效性的张量程序表示方法(紧凑抽象语法树)及基于前序遍历的位置编码技术,以捕捉张量程序的内部结构。受领域自适应启发,我们开发了学习域不变表示的方法,并设计基于K-Means的采样算法,使预测器能够从不同域(即不同DNN算子与设备)中学习。在多样化DNN模型与设备上的大量实验表明,CDMPP在跨模型与跨设备预测中分别以14.03%和10.85%的预测误差显著超越最先进基线方法,且训练效率提升一个数量级。实现代码与扩展数据集已开源至https://github.com/joapolarbear/cdmpp。