How many tokens can a GPU inference cluster deliver per watt? Across deployments of identical hardware, the answer varies by 40x -- not because of software inefficiency, but because of the serving context window. We derive the 1/W law: tokens per watt halves every time the context window doubles. A larger context window shrinks the KV-cache concurrency limit while leaving GPU power draw roughly unchanged. At 64K context, an H100 holds 16 sequences in flight (tok/W = 1.5); at 4K context, the same H100 holds 256 sequences (tok/W = 17.6). Routing topology -- which determines the effective context window each GPU services -- is a more powerful energy lever than buying newer hardware. Working from published H100 power measurements, a calibrated logistic power model, and a roofline throughput model, we derive these results analytically using the inference-fleet-sim framework; no new hardware experiments were conducted. Two-pool context-length routing (FleetOpt) delivers roughly 2.5x better tok/W over a homogeneous fleet, while upgrading from H100 to B200 delivers roughly 1.7x. The gains are independent: combining FleetOpt with B200 yields 4.25x over the H100 homogeneous baseline. B200/H200 numbers are analytical projections (+-20% uncertainty); H100 results are calibrated to published measurements. For MoE models, active-parameter weight streaming adds a third lever. Qwen3-235B-A22B (22B active) reaches roughly 37.8 tok/W at 8K context on H100 -- 5.1x better than Llama-3.1-70B -- because decode time scales with activated weights, not total parameters. MoE dispatch overhead is excluded, so this is an upper bound.
翻译:每瓦特GPU推理集群能交付多少token?在相同硬件部署条件下,答案差异可达40倍——这并非源于软件低效,而是由服务上下文窗口决定。我们推导出1/W定律:上下文窗口每扩大一倍,每瓦特token数减半。更大的上下文窗口会压缩KV缓存并发容量上限,而GPU功耗基本保持不变。在64K上下文下,H100可同时处理16条序列(token/W=1.5);而在4K上下文下,同一H100可处理256条序列(token/W=17.6)。决定每块GPU服务有效上下文窗口的路由拓扑,是比购买更新硬件更强大的能效杠杆。基于已发表的H100功耗测量数据、校准逻辑功耗模型及屋顶线吞吐量模型,我们采用推理集群仿真框架分析推导上述结论,未开展新硬件实验。双池上下文长度路由(FleetOpt)相较同质集群可提升约2.5倍的tok/W能效,而从H100升级至B200仅带来约1.7倍提升。两种增益相互独立:将FleetOpt与B200结合使用,可在H100同质基线基础上实现4.25倍提升。B200/H200数据为分析预测(±20%不确定性),H100结果已对标已发表测量数据进行校准。对于MoE模型,激活参数权重流构成第三杠杆。Qwen3-235B-A22B(22B激活参数)在H100上以8K上下文达到约37.8 tok/W——较Llama-3.1-70B提升5.1倍——原因在于解码时间与激活权重规模成正比,而非总参数规模。上述结果未计入MoE调度开销,因此为理论上限。