The AI inference industry keeps models loaded in GPU memory around the clock to avoid cold-start latency, implicitly treating idle power as a fixed cost of readiness. Yet the structure of this cost has never been empirically decomposed - and never across GPU architectures. We present the first cross-architecture measurement of idle GPU power as a function of VRAM allocation, combining 18 days of production telemetry (335,267 samples, 14 H100 GPUs) with controlled dose-response experiments on three GPU architectures spanning three memory technologies: NVIDIA H100 (HBM3, 80 GB), A100 (HBM2e, 80 GB), and L40S (GDDR6, 48 GB). We observe that idle power is piecewise constant on all three architectures: the CUDA context forces a discrete DVFS transition consuming +26-66 W over bare idle (26-50 W on HBM architectures, 66 W on GDDR6), while the marginal VRAM effect is bounded below measurement relevance ($|β| < 0.02$ W/GB) on every device tested. The CUDA context accounts for >98% of the parking tax regardless of memory technology. We validate this finding with a real HuggingFace model (Qwen2.5-7B) on all three architectures, confirming <0.5 W difference from empty tensors on every device, and capture cold-start power profiles during model loading. We derive a cold-start breakeven model showing energy-optimal behavior depends on request arrival rate and loading latency - not model size - with breakeven intervals of 1-5 minutes. Our results identify a constraint consistent across all tested architectures: idle-with-context power is determined by DVFS state, not memory occupancy.
翻译:AI推理行业为确保模型在GPU内存中全天候加载以避免冷启动延迟,将空闲功耗视为就绪状态的固定成本。然而,这种成本的结构从未被经验性地分解过——更未跨越不同GPU架构进行过分析。我们首次实现了跨架构的空闲GPU功耗测量,将其作为VRAM分配的函数,结合了18天的生产遥测数据(335,267个样本,14块H100 GPU)与覆盖三种内存技术的三类GPU架构上的可控剂量-响应实验:NVIDIA H100(HBM3,80 GB)、A100(HBM2e,80 GB)和L40S(GDDR6,48 GB)。我们观察到,在所有三种架构上,空闲功耗呈分段常数特性:CUDA上下文强制触发离散的DVFS状态切换,在纯空闲基础上额外消耗26-66 W(HBM架构上为26-50 W,GDDR6上为66 W),而边际VRAM效应在每个测试设备上均低于测量相关阈值(|β| < 0.02 W/GB)。无论内存技术如何,CUDA上下文占停车税总额的98%以上。我们通过在所有三种架构上运行真实HuggingFace模型(Qwen2.5-7B)验证了这一发现,确认每个设备上与空张量的差异小于0.5 W,并捕获了模型加载过程中的冷启动功耗曲线。我们推导出一个冷启动盈亏平衡模型,表明能量最优行为取决于请求到达率和加载延迟——而非模型大小——平衡区间为1-5分钟。我们的结果识别出一个在所有测试架构上一致的约束条件:带上下文的空闲功耗由DVFS状态决定,而非内存占用率。