As AI inference scales to billions of queries, estimates of per-query energy use are increasingly important for capacity planning, efficiency interventions, and policy. Yet many public estimates assume non-production settings, leading to systematic overestimation. We introduce a bottom-up framework estimating inference energy from token throughput, node power, and overhead under large-scale deployment assumptions. For frontier-scale models (>200B parameters) on H100 nodes, we estimate a median energy of 0.31 Wh/query (IQR 0.16-0.60), indicating widely cited estimates are overstated by 4-20x. In test-time scaling scenarios 15x longer than typical queries, the median energy rises 13x to 3.91 Wh (IQR 2.15-7.05). Across models, serving systems, and hardware, we estimate 8-20x line-of-sight energy reductions. At datacenter scale, serving 1 billion queries/day requires 0.7 GWh; if 10% are long queries, demand rises to 1.7 GWh/day. With efficiency interventions, it falls to 0.8 GWh/day, mitigating the energy impact of test-time scaling.
翻译:随着人工智能推理扩展至数十亿次查询,每次查询的能源消耗估算对于容量规划、效率干预和政策制定日益重要。然而,许多公开估算假设非生产环境,导致系统性高估。我们提出一个自下而上的框架,基于大规模部署假设,通过令牌吞吐量、节点功耗和开销来估算推理能耗。对于H100节点上的前沿模型(参数超过2000亿),我们估算每次查询的中位能耗为0.31瓦时(四分位距0.16-0.60),表明广泛引用的估算被高估了4-20倍。在测试时间扩展场景中,当查询时间比典型查询长15倍时,中位能耗上升13倍至3.91瓦时(四分位距2.15-7.05)。跨越不同模型、服务系统和硬件,我们估算出8-20倍的直接能耗缩减空间。在数据中心规模下,每天服务10亿次查询需要0.7吉瓦时;如果其中10%为长查询,则需求升至每天1.7吉瓦时。通过效率干预,这一数值可降至每天0.8吉瓦时,从而缓解测试时间扩展带来的能源影响。