The increasing deployment of large language models (LLMs) in natural language processing (NLP) tasks raises concerns about energy efficiency and sustainability. While prior research has largely focused on energy consumption during model training, the inference phase has received comparatively less attention. This study systematically evaluates the trade-offs between model accuracy and energy consumption in text classification inference across various model architectures and hardware configurations. Our empirical analysis shows that in some contexts the best-performing model in terms of accuracy can also be energy-efficient. While LLMs tend to consume significantly more energy than traditional machine learning models, they show the same or even lower levels of accuracy in our zero-shot classification setting. We observe substantial variability in inference energy consumption ($<$mWh to $>$kWh), influenced by model type, model size, and hardware specifications. Additionally, we find a strong correlation between inference energy consumption and model runtime, indicating that execution time can serve as a practical proxy for energy usage in settings where direct measurement is not feasible. Our findings demonstrate that energy efficiency and accuracy represent distinct evaluation dimensions that do not necessarily align. We argue that sustainable AI development requires systematic evaluation of both performance and resource efficiency.
翻译:大型语言模型(LLMs)在自然语言处理(NLP)任务中的日益部署引发了关于能源效率和可持续性的担忧。尽管以往研究主要关注模型训练阶段的能耗,但推理阶段受到的关注相对较少。本研究系统评估了不同模型架构和硬件配置下,文本分类推理中模型准确度与能耗之间的权衡。我们的实证分析表明,在某些情境下,准确度最优的模型同样可能具备节能特性。尽管LLMs的能耗通常显著高于传统机器学习模型,但在零样本分类设置中,它们表现出的准确度却相同甚至更低。我们观察到推理能耗存在显著差异(从<毫瓦时到大千瓦时),这一差异受模型类型、模型大小及硬件规格的影响。此外,我们发现推理能耗与模型运行时间之间存在强相关性,这表明在无法直接测量能耗的场景下,执行时间可作为实用的近似指标。我们的研究结果表明,能源效率和准确度代表了两个未必一致的独立评估维度。我们认为,可持续人工智能的发展需要对性能与资源效率进行系统性评估。