Current AI energy benchmarks measure consumption at the granularity of a single model invocation or training run. For classical single-turn workloads this unit remains coherent. For agentic systems - where a single user goal may trigger multi-step orchestration, tool calls, retries, and failure-recovery cycles - the invocation count is an implementation artifact rather than a task property, and inference-level normalization misrepresents the energy cost of goal completion. We present A-LEMS (Agentic LLM Energy Measurement System), a cross-layer measurement framework that redefines the unit of AI energy accounting from energy per inference to Energy per Successful Goal (EpG). EpG aggregates total workflow energy across all execution attempts, including failures and retries, normalized by successfully completed goals. A-LEMS formalizes energy attribution through a temporal boundary model, a five-layer observation pipeline mapping RAPL signals to workflow-level energy, and a reproducibility protocol binding every measurement to hardware and runtime configuration. Building on EpG, we define the Orchestration Overhead Index (OOI), isolating the energy cost of orchestration relative to linear execution under identical task criteria. Across five reasoning and three tool-augmented task families, agentic workflows consume 4.33x higher mean energy per successful goal than linear baselines (888.1 J vs 205.3 J). This overhead is driven by orchestration structure, not inference compute. For tool-augmented tasks, OOI inverts below 1.0x: agentic execution is cheaper than linear, confirming the metric captures orchestration structure rather than a fixed upward bias. These findings establish that energy-per-inference is insufficient for agentic AI. EpG and OOI provide the measurement foundation for accurate benchmarking, where orchestration structure is the primary determinant of energy cost.
翻译:当前AI能耗基准以单次模型调用或训练运行的粒度衡量能耗。对于经典的单轮任务,该单位保持一致性。但对于自主系统——其中单个用户目标可能触发多步编排、工具调用、重试及失败恢复循环——调用次数是实现产物而非任务属性,推理级归一化会歪曲目标完成所需的能耗成本。我们提出A-LEMS(自主大语言模型能耗测量系统),这是一个跨层测量框架,将AI能耗核算单位从"每次推理能耗"重新定义为"每成功目标能耗"(EpG)。EpG聚合所有执行尝试(包括失败和重试)的工作流总能耗,并按成功完成的目标数量进行归一化。A-LEMS通过时间边界模型、将RAPL信号映射至工作流级能耗的五层观测流水线,以及将每次测量与硬件及运行时配置绑定的可复现性协议,实现了能耗归属的形式化。基于EpG,我们定义编排开销指数(OOI),用以在相同任务条件下隔离编排相对于线性执行的能耗成本。在五个推理任务族和三个工具增强任务族中,自主工作流每成功目标平均能耗比线性基线高4.33倍(888.1焦耳对比205.3焦耳)。该开销由编排结构驱动,而非推理计算量。对于工具增强任务,OOI反转至低于1.0倍:自主执行比线性更经济,证实该指标捕获的是编排结构而非固定向上偏差。这些发现表明,"每次推理能耗"对于自主AI并不充分。EpG与OOI为准确基准测试提供了测量基础,其中编排结构是能耗成本的主要决定因素。