Energy consumption of code small language models serving with runtime engines and execution providers

Background. The rapid growth of Language Models (LMs), particularly in code generation, requires substantial computational resources, raising concerns about energy consumption and environmental impact. Optimizing LMs inference for energy efficiency is crucial, and Small Language Models (SLMs) offer a promising solution to reduce resource demands. Aim. Our goal is to analyze the impact of deep learning runtime engines and execution providers on energy consumption, execution time, and computing-resource utilization from the point of view of software engineers conducting inference in the context of code SLMs. Method. We conducted a technology-oriented, multi-stage experimental pipeline using twelve code generation SLMs to investigate energy consumption, execution time, and computing-resource utilization across the configurations. Results. Significant differences emerged across configurations. CUDA execution provider configurations outperformed CPU execution provider configurations in both energy consumption and execution time. Among the configurations, TORCH paired with CUDA demonstrated the greatest energy efficiency, achieving energy savings from 37.99% up to 89.16% compared to other serving configurations. Similarly, optimized runtime engines like ONNX with the CPU execution provider achieved from 8.98% up to 72.04% energy savings within CPU-based configurations. Also, TORCH paired with CUDA exhibited efficient computing-resource utilization. Conclusions. Serving configuration choice significantly impacts energy efficiency. While further research is needed, we recommend the above configurations best suited to software engineers' requirements for enhancing serving efficiency in energy and performance.

翻译：背景。语言模型（LMs）的快速增长，特别是在代码生成领域，需要大量的计算资源，引发了关于能耗与环境影响的担忧。优化语言模型推理的能效至关重要，而小型语言模型（SLMs）为降低资源需求提供了一个有前景的解决方案。目的。我们的目标是从软件工程师在代码SLMs推理场景的角度，分析深度学习运行时引擎与执行提供程序对能耗、执行时间及计算资源利用率的影响。方法。我们采用了一项技术导向、多阶段的实验流程，使用十二个代码生成SLMs，研究了不同配置下的能耗、执行时间与计算资源利用率。结果。不同配置间存在显著差异。CUDA执行提供程序配置在能耗与执行时间上均优于CPU执行提供程序配置。在所有配置中，TORCH与CUDA的组合表现出最高的能效，相比其他服务配置，实现了从37.99%到89.16%的能耗节省。同样，在基于CPU的配置中，优化的运行时引擎（如ONNX）与CPU执行提供程序配合，实现了从8.98%到72.04%的能耗节省。此外，TORCH与CUDA的组合也展现了高效的计算资源利用率。结论。服务配置的选择对能效有显著影响。虽然仍需进一步研究，但我们推荐上述最适合软件工程师需求的配置，以提升服务在能耗与性能方面的效率。

相关内容

Engineering

关注 6

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

《图机器学习》课程

专知会员服务

49+阅读 · 2024年2月18日

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日