From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference

Large language models (LLMs) have exploded in popularity due to their new generative capabilities that go far beyond prior state-of-the-art. These technologies are increasingly being leveraged in various domains such as law, finance, and medicine. However, these models carry significant computational challenges, especially the compute and energy costs required for inference. Inference energy costs already receive less attention than the energy costs of training LLMs -- despite how often these large models are called on to conduct inference in reality (e.g., ChatGPT). As these state-of-the-art LLMs see increasing usage and deployment in various domains, a better understanding of their resource utilization is crucial for cost-savings, scaling performance, efficient hardware usage, and optimal inference strategies. In this paper, we describe experiments conducted to study the computational and energy utilization of inference with LLMs. We benchmark and conduct a preliminary analysis of the inference performance and inference energy costs of different sizes of LLaMA -- a recent state-of-the-art LLM -- developed by Meta AI on two generations of popular GPUs (NVIDIA V100 \& A100) and two datasets (Alpaca and GSM8K) to reflect the diverse set of tasks/benchmarks for LLMs in research and practice. We present the results of multi-node, multi-GPU inference using model sharding across up to 32 GPUs. To our knowledge, our work is the one of the first to study LLM inference performance from the perspective of computational and energy resources at this scale.

翻译：大规模语言模型因其远超以往最先进技术的新生成能力而迅速普及。这些技术正越来越多地被应用于法律、金融和医学等多个领域。然而，这些模型带来了巨大的计算挑战，特别是推理所需的计算和能源成本。尽管现实中这些大型模型被频繁调用进行推理（例如ChatGPT），推理能源成本所受到的关注仍远低于训练能源成本。随着这些最先进的大规模语言模型在各领域的应用和部署日益增加，深入了解其资源利用情况对于节约成本、扩展性能、高效硬件使用以及优化推理策略至关重要。本文描述了为研究大规模语言模型推理的计算与能源利用而开展的实验。我们对Meta AI开发的最新最先进大规模语言模型——LLaMA——的不同规模版本，在两代主流GPU（NVIDIA V100和A100）以及两个数据集（Alpaca和GSM8K）上，进行了推理性能和推理能源成本的基准测试与初步分析。这些数据集反映了研究和实践中大规模语言模型所面临的多样化任务与基准。我们展示了在多达32块GPU上使用模型分片进行多节点、多GPU推理的结果。据我们所知，本文是首批从计算与能源资源角度在此规模下研究大规模语言模型推理性能的工作之一。