Large Language Models (LLMs) inference is central to modern AI applications, dominating worldwide datacenter workloads, making it critical to predict its energy footprint. Existing approaches estimate energy consumption as a simple linear function of input and output sequence. However, by analyzing the autoregressive structure of Transformers, which implies a fundamentally non-linear relationship between input and output sequence lengths and energy consumption, we demonstrate the existence of a generation energy minima. Peak efficiency occurs with short-to-moderate inputs and medium-length outputs, while efficiency drops sharply for long inputs or very short outputs. Consequently, we propose SweetSpot, an analytical model derived from the computational and memory-access complexity of the Transformer architecture, which accurately characterizes the efficiency curve as a function of input and output lengths. To assess accuracy, we measure energy consumption using TensorRT-LLM on NVIDIA H100 GPUs across a diverse set of LLMs ranging from 1B to 9B parameters, including OPT, LLaMA, Gemma, Falcon, Qwen2, and Granite. We test input and output lengths from 64 to 4096 tokens and achieve a mean MAPE of 1.79%. Our results show that aligning sequence lengths with these efficiency "sweet spots" reduce energy usage, up to 33.41x, enabling informed truncation, summarization, and adaptive generation strategies in production systems.
翻译:大语言模型(LLM)推理是现代人工智能应用的核心,其占据了全球数据中心工作负载的主导地位,因此准确预测其能耗至关重要。现有方法通常将能耗估计为输入与输出序列长度的简单线性函数。然而,通过分析Transformer的自回归结构——该结构意味着输入/输出序列长度与能耗之间存在本质上的非线性关系——我们证明了生成能耗最小点的存在。最高能效出现在输入长度较短至中等、输出长度中等的情况下,而对于长输入或极短输出,能效会急剧下降。为此,我们提出了SweetSpot,这是一个基于Transformer架构计算复杂度与内存访问复杂度推导出的分析模型,能够准确刻画能效随输入和输出长度变化的曲线。为评估模型准确性,我们使用TensorRT-LLM在NVIDIA H100 GPU上测量了多种参数规模(1B至9B)LLM的能耗,包括OPT、LLaMA、Gemma、Falcon、Qwen2和Granite。测试覆盖了64到4096个令牌的输入输出长度,平均绝对百分比误差(MAPE)为1.79%。实验结果表明,将序列长度对齐至这些能效“最佳点”可显著降低能耗(最高可达33.41倍),从而为生产系统中的智能截断、摘要生成和自适应生成策略提供依据。