While the large energy consumption of Large Language Models (LLMs) is recognized by the community, system operators lack guidance for energy-efficient LLM inference deployments that leverage energy trade-offs of heterogeneous hardware due to a lack of energy-aware benchmarks and data. In this work we address this gap with Watt Counts: the largest open-access dataset of energy consumption of LLMs, with over 5,000 experiments for 50 LLMs across 10 NVIDIA Graphics Processing Units (GPUs) in batch and server scenarios along with a reproducible, open-source benchmark that enables community submissions to expand this dataset. Leveraging this dataset, we conduct a system-level study of LLM inference across heterogeneous GPU architectures and show that GPU selection is crucial for energy efficiency outcomes and that optimal hardware choices vary significantly across models and deployment scenarios, demonstrating the critical importance of hardware-aware deployment in heterogeneous LLM systems. Guided by our data and insights, we show that practitioners can reduce energy consumption by up to 70% in server scenarios with negligible impact on user experience, and by up to 20% in batch scenarios.
翻译:尽管大型语言模型(LLM)的巨大能耗已为学界所共识,但由于缺乏能耗感知基准与数据,系统运营者在利用异构硬件的能耗权衡实现LLM高效推理部署时仍缺乏指导。本研究通过瓦特计数填补这一空白:该数据集是规模最大的公开LLM能耗数据集,涵盖50个LLM在10款NVIDIA图形处理器(GPU)上的5000余项实验(包含批处理与服务器场景),并配套提供可复现的开源基准测试框架,支持社区提交数据以扩展该数据集。依托该数据集,我们对跨异构GPU架构的LLM推理开展系统级研究,揭示GPU选型对能耗效率结果具有决定性影响,且最优硬件选择因模型与部署场景差异显著,充分论证了异构LLM系统中硬件感知部署的关键重要性。基于数据洞察的指导,实践者在服务器场景中可在对用户体验影响微乎其微的前提下降低高达70%的能耗,在批处理场景中则可实现20%的能耗降幅。