Deep Learning (DL) frameworks such as PyTorch and TensorFlow include runtime infrastructures responsible for executing trained models on target hardware, managing memory, data transfers, and multi-accelerator execution, if applicable. Additionally, it is a common practice to deploy pre-trained models on environments distinct from their native development settings. This led to the introduction of interchange formats such as ONNX, which includes its runtime infrastructure, and ONNX Runtime, which work as standard formats that can be used across diverse DL frameworks and languages. Even though these runtime infrastructures have a great impact on inference performance, no previous paper has investigated their energy efficiency. In this study, we monitor the energy consumption and inference time in the runtime infrastructures of three well-known DL frameworks as well as ONNX, using three various DL models. To have nuance in our investigation, we also examine the impact of using different execution providers. We find out that the performance and energy efficiency of DL are difficult to predict. One framework, MXNet, outperforms both PyTorch and TensorFlow for the computer vision models using batch size 1, due to efficient GPU usage and thus low CPU usage. However, batch size 64 makes PyTorch and MXNet practically indistinguishable, while TensorFlow is outperformed consistently. For BERT, PyTorch exhibits the best performance. Converting the models to ONNX yields significant performance improvements in the majority of cases. Finally, in our preliminary investigation of execution providers, we observe that TensorRT always outperforms CUDA.
翻译:深度学习框架(如PyTorch和TensorFlow)包含负责在目标硬件上执行训练模型、管理内存、数据传输以及多加速器执行(如适用)的运行时基础设施。此外,将预训练模型部署到与其原生开发环境不同的环境中是一种常见做法。这促使了ONNX等交换格式及其运行时基础设施ONNX Runtime的引入,它们作为标准格式可用于不同深度学习框架和语言。尽管这些运行时基础设施对推理性能有重大影响,但尚无先前研究探讨其能效。在本研究中,我们使用三种不同的深度学习模型,监测三个知名深度学习框架及ONNX运行时基础设施的能耗和推理时间。为使研究更具层次性,我们还考察了使用不同执行提供商的影响。我们发现深度学习的性能和能效难以预测。对于使用批量大小为1的计算机视觉模型,MXNet框架因高效的GPU使用率及由此带来的低CPU使用率而优于PyTorch和TensorFlow。然而,当批量大小为64时,PyTorch与MXNet在实践中几乎无差别,而TensorFlow则持续表现较差。对于BERT模型,PyTorch展现出最佳性能。将模型转换为ONNX格式在多数情况下带来了显著的性能提升。最后,在对执行提供商的初步研究中,我们观察到TensorRT始终优于CUDA。