Large Language Models (LLMs) inference is central in modern AI applications, making it critical to understand their energy footprint. Existing approaches typically estimate energy consumption through simple linear functions of input and output sequence lengths, yet our observations reveal clear Energy Efficiency regimes: peak efficiency occurs with short-to-moderate inputs and medium-length outputs, while efficiency drops sharply for long inputs or very short outputs, indicating a non-linear dependency. In this work, we propose an analytical model derived from the computational and memory-access complexity of the Transformer architecture, capable of accurately characterizing the efficiency curve as a function of input and output lengths. To assess its accuracy, we evaluate energy consumption using TensorRT-LLM on NVIDIA H100 GPUs across a diverse set of LLMs ranging from 1B to 9B parameters, including OPT, LLaMA, Gemma, Falcon, Qwen2, and Granite, tested over input and output lengths from 64 to 4096 tokens, achieving a mean MAPE of 1.79%. Our results show that aligning sequence lengths with these efficiency "Sweet Spots" can substantially reduce energy usage, supporting informed truncation, summarization, and adaptive generation strategies in production systems.
翻译:大型语言模型(LLM)推理是现代人工智能应用的核心环节,因此理解其能耗足迹至关重要。现有方法通常通过输入与输出序列长度的简单线性函数来估算能耗,但我们的观测揭示了清晰的能效区间:最高效率出现在输入长度较短至中等且输出长度适中的情况下,而对于长输入或极短输出,效率会急剧下降,表明其存在非线性依赖关系。本研究基于Transformer架构的计算与内存访问复杂度,提出了一个解析模型,能够准确刻画能效曲线随输入和输出长度变化的函数关系。为评估其准确性,我们使用TensorRT-LLM在NVIDIA H100 GPU上对参数规模从1B到9B的多种LLM进行了能耗评估,包括OPT、LLaMA、Gemma、Falcon、Qwen2和Granite模型,测试覆盖64至4096个令牌的输入输出长度范围,平均绝对百分比误差(MAPE)达到1.79%。研究结果表明,将序列长度与这些能效“最佳点”对齐可显著降低能耗,从而为生产系统中的智能截断、文本摘要和自适应生成策略提供理论依据。