The rapid adoption of large language models (LLMs) has led to significant advances in natural language processing and text generation. However, the energy consumed through LLM model inference remains a major challenge for sustainable AI deployment. To address this problem, we model the workload-dependent energy consumption and runtime of LLM inference tasks on heterogeneous GPU-CPU systems. By conducting an extensive characterization study of several state-of-the-art LLMs and analyzing their energy and runtime behavior across different magnitudes of input prompts and output text, we develop accurate (R^2>0.96) energy and runtime models for each LLM. We employ these models to explore an offline, energy-optimal LLM workload scheduling framework. Through a case study, we demonstrate the advantages of energy and accuracy aware scheduling compared to existing best practices.
翻译:大型语言模型的快速普及推动了自然语言处理和文本生成领域的显著进步。然而,LLM模型推理所消耗的能量仍然是可持续人工智能部署面临的主要挑战。为解决这一问题,我们对异构GPU-CPU系统上LLM推理任务的工作负载依赖性能耗与运行时间进行了建模。通过对多个先进LLM进行广泛的特性研究,并分析其在不同规模输入提示和输出文本下的能量与运行时行为,我们为每个LLM建立了精确的能量与运行时模型。我们运用这些模型探索了一种离线的、能量最优的LLM工作负载调度框架。通过案例研究,我们展示了能量与精度感知调度相较于现有最佳实践的优势。