Deep Learning (DL) frameworks such as PyTorch and TensorFlow include runtime infrastructures responsible for executing trained models on target hardware, managing memory, data transfers, and multi-accelerator execution, if applicable. Additionally, it is a common practice to deploy pre-trained models on environments distinct from their native development settings. This led to the introduction of interchange formats such as ONNX, which includes its runtime infrastructure, and ONNX Runtime, which work as standard formats that can be used across diverse DL frameworks and languages. Even though these runtime infrastructures have a great impact on inference performance, no previous paper has investigated their energy efficiency. In this study, we monitor the energy consumption and inference time in the runtime infrastructures of three well-known DL frameworks as well as ONNX, using three various DL models. To have nuance in our investigation, we also examine the impact of using different execution providers. We find out that the performance and energy efficiency of DL are difficult to predict. One framework, MXNet, outperforms both PyTorch and TensorFlow for the computer vision models using batch size 1, due to efficient GPU usage and thus low CPU usage. However, batch size 64 makes PyTorch and MXNet practically indistinguishable, while TensorFlow is outperformed consistently. For BERT, PyTorch exhibits the best performance. Converting the models to ONNX usually yields significant performance improvements but the ONNX converted ResNet model with batch size 64 consumes approximately 10% more energy and time than the original PyTorch model.
翻译:深度学习框架(如PyTorch和TensorFlow)包含负责在目标硬件上执行训练模型、管理内存、数据传输以及多加速器执行的运行时基础设施。此外,将预训练模型部署到与其原生开发环境不同的环境中已成为常见实践。这促使了ONNX等交换格式(及其运行时基础设施ONNX Runtime)的引入,这些格式作为标准格式可被多种深度学习框架和语言使用。尽管这些运行时基础设施对推理性能有重大影响,但此前尚无研究探讨其能效。本研究使用三种不同的深度学习模型,监测了三种主流深度学习框架及ONNX的运行时基础设施的能耗和推理时间。为深入探究,我们还分析了不同执行提供程序的影响。研究发现,深度学习的性能和能效难以预测。对于使用批次大小为1的计算机视觉模型,MXNet框架凭借高效的GPU使用率及极低的CPU负载,表现优于PyTorch和TensorFlow;但当批次大小为64时,PyTorch与MXNet性能几乎无差异,而TensorFlow则持续落后。对于BERT模型,PyTorch展现出最佳性能。将模型转换为ONNX通常能显著提升性能,但批次大小为64的ONNX转换版ResNet模型比原始PyTorch模型多消耗约10%的能耗和时间。