With the ubiquitous use of modern large language models (LLMs) across industries, the inference serving for these models is ever expanding. Given the high compute and memory requirements of modern LLMs, more and more top-of-the-line GPUs are being deployed to serve these models. Energy availability has come to the forefront as the biggest challenge for data center expansion to serve these models. In this paper, we present the trade-offs brought up by making energy efficiency the primary goal of LLM serving under performance SLOs. We show that depending on the inputs, the model, and the service-level agreements, there are several knobs available to the LLM inference provider to use for being energy efficient. We characterize the impact of these knobs on the latency, throughput, as well as the energy. By exploring these trade-offs, we offer valuable insights into optimizing energy usage without compromising on performance, thereby paving the way for sustainable and cost-effective LLM deployment in data center environments.
翻译:随着现代大型语言模型(LLM)在各行业的广泛应用,这些模型的推理服务正在不断扩展。鉴于现代LLM对计算和内存的高需求,越来越多的高端GPU被部署用于服务这些模型。能源可用性已成为数据中心扩展以服务这些模型的最大挑战。本文提出了在以能效作为LLM服务首要目标、并满足性能SLO的情况下所产生的权衡。我们表明,根据输入、模型和服务级别协议,LLM推理提供商可利用多种调控手段来实现能效优化。我们刻画了这些调控手段对延迟、吞吐量以及能耗的影响。通过探索这些权衡,我们为在不影响性能的前提下优化能源使用提供了宝贵见解,从而为数据中心环境中可持续且经济高效的LLM部署铺平道路。