Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems

Current AI energy benchmarks measure consumption at the granularity of a single model invocation or training run. For classical single-turn workloads this unit remains coherent. For agentic systems - where a single user goal may trigger multi-step orchestration, tool calls, retries, and failure-recovery cycles - the invocation count is an implementation artifact rather than a task property, and inference-level normalization misrepresents the energy cost of goal completion. We present A-LEMS (Agentic LLM Energy Measurement System), a cross-layer measurement framework that redefines the unit of AI energy accounting from energy per inference to Energy per Successful Goal (EpG). EpG aggregates total workflow energy across all execution attempts, including failures and retries, normalized by successfully completed goals. A-LEMS formalizes energy attribution through a temporal boundary model, a five-layer observation pipeline mapping RAPL signals to workflow-level energy, and a reproducibility protocol binding every measurement to hardware and runtime configuration. Building on EpG, we define the Orchestration Overhead Index (OOI), isolating the energy cost of orchestration relative to linear execution under identical task criteria. Across five reasoning and three tool-augmented task families, agentic workflows consume 4.33x higher mean energy per successful goal than linear baselines (888.1 J vs 205.3 J). This overhead is driven by orchestration structure, not inference compute. For tool-augmented tasks, OOI inverts below 1.0x: agentic execution is cheaper than linear, confirming the metric captures orchestration structure rather than a fixed upward bias. These findings establish that energy-per-inference is insufficient for agentic AI. EpG and OOI provide the measurement foundation for accurate benchmarking, where orchestration structure is the primary determinant of energy cost.

翻译：当前AI能耗基准以单次模型调用或训练运行的粒度衡量能耗。对于经典的单轮任务，该单位保持一致性。但对于自主系统——其中单个用户目标可能触发多步编排、工具调用、重试及失败恢复循环——调用次数是实现产物而非任务属性，推理级归一化会歪曲目标完成所需的能耗成本。我们提出A-LEMS（自主大语言模型能耗测量系统），这是一个跨层测量框架，将AI能耗核算单位从"每次推理能耗"重新定义为"每成功目标能耗"（EpG）。EpG聚合所有执行尝试（包括失败和重试）的工作流总能耗，并按成功完成的目标数量进行归一化。A-LEMS通过时间边界模型、将RAPL信号映射至工作流级能耗的五层观测流水线，以及将每次测量与硬件及运行时配置绑定的可复现性协议，实现了能耗归属的形式化。基于EpG，我们定义编排开销指数（OOI），用以在相同任务条件下隔离编排相对于线性执行的能耗成本。在五个推理任务族和三个工具增强任务族中，自主工作流每成功目标平均能耗比线性基线高4.33倍（888.1焦耳对比205.3焦耳）。该开销由编排结构驱动，而非推理计算量。对于工具增强任务，OOI反转至低于1.0倍：自主执行比线性更经济，证实该指标捕获的是编排结构而非固定向上偏差。这些发现表明，"每次推理能耗"对于自主AI并不充分。EpG与OOI为准确基准测试提供了测量基础，其中编排结构是能耗成本的主要决定因素。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

AI 智能体系统：体系架构、应用场景及评估范式

专知会员服务

70+阅读 · 1月6日

AI专题·Agent：智能体基建厚积薄发，商业化应用曙光乍现

专知会员服务

34+阅读 · 2025年4月24日

《面向边缘AI应用的高性能高能效架构探索》156页

专知会员服务

37+阅读 · 2025年4月12日

算力报告：算力供需双向走强，AI催化Infra建设新征程

专知会员服务

38+阅读 · 2024年9月7日