Background. The rapid growth of Language Models (LMs), particularly in code generation, requires substantial computational resources, raising concerns about energy consumption and environmental impact. Optimizing LMs inference for energy efficiency is crucial, and Small Language Models (SLMs) offer a promising solution to reduce resource demands. Aim. Our goal is to analyze the impact of deep learning runtime engines and execution providers on energy consumption, execution time, and computing-resource utilization from the point of view of software engineers conducting inference in the context of code SLMs. Method. We conducted a technology-oriented, multi-stage experimental pipeline using twelve code generation SLMs to investigate energy consumption, execution time, and computing-resource utilization across the configurations. Results. Significant differences emerged across configurations. CUDA execution provider configurations outperformed CPU execution provider configurations in both energy consumption and execution time. Among the configurations, TORCH paired with CUDA demonstrated the greatest energy efficiency, achieving energy savings from 37.99% up to 89.16% compared to other serving configurations. Similarly, optimized runtime engines like ONNX with the CPU execution provider achieved from 8.98% up to 72.04% energy savings within CPU-based configurations. Also, TORCH paired with CUDA exhibited efficient computing-resource utilization. Conclusions. Serving configuration choice significantly impacts energy efficiency. While further research is needed, we recommend the above configurations best suited to software engineers' requirements for enhancing serving efficiency in energy and performance.
翻译:背景。语言模型(LMs)的快速增长,特别是在代码生成领域,需要大量的计算资源,引发了关于能耗与环境影响的担忧。优化语言模型推理的能效至关重要,而小型语言模型(SLMs)为降低资源需求提供了一个有前景的解决方案。目的。我们的目标是从软件工程师在代码SLMs推理场景的角度,分析深度学习运行时引擎与执行提供程序对能耗、执行时间及计算资源利用率的影响。方法。我们采用了一项技术导向、多阶段的实验流程,使用十二个代码生成SLMs,研究了不同配置下的能耗、执行时间与计算资源利用率。结果。不同配置间存在显著差异。CUDA执行提供程序配置在能耗与执行时间上均优于CPU执行提供程序配置。在所有配置中,TORCH与CUDA的组合表现出最高的能效,相比其他服务配置,实现了从37.99%到89.16%的能耗节省。同样,在基于CPU的配置中,优化的运行时引擎(如ONNX)与CPU执行提供程序配合,实现了从8.98%到72.04%的能耗节省。此外,TORCH与CUDA的组合也展现了高效的计算资源利用率。结论。服务配置的选择对能效有显著影响。虽然仍需进一步研究,但我们推荐上述最适合软件工程师需求的配置,以提升服务在能耗与性能方面的效率。