The Model Parking Tax: Quantifying the Hidden Energy Cost of Always-On GPU Model Deployment

The AI inference industry keeps models loaded in GPU memory around the clock to avoid cold-start latency, implicitly treating idle power as a fixed cost of readiness. Yet the structure of this cost has never been empirically decomposed - and never across GPU architectures. We present the first cross-architecture measurement of idle GPU power as a function of VRAM allocation, combining 18 days of production telemetry (335,267 samples, 14 H100 GPUs) with controlled dose-response experiments on three GPU architectures spanning three memory technologies: NVIDIA H100 (HBM3, 80 GB), A100 (HBM2e, 80 GB), and L40S (GDDR6, 48 GB). We observe that idle power is piecewise constant on all three architectures: the CUDA context forces a discrete DVFS transition consuming +26-66 W over bare idle (26-50 W on HBM architectures, 66 W on GDDR6), while the marginal VRAM effect is bounded below measurement relevance ($|β| < 0.02$ W/GB) on every device tested. The CUDA context accounts for >98% of the parking tax regardless of memory technology. We validate this finding with a real HuggingFace model (Qwen2.5-7B) on all three architectures, confirming <0.5 W difference from empty tensors on every device, and capture cold-start power profiles during model loading. We derive a cold-start breakeven model showing energy-optimal behavior depends on request arrival rate and loading latency - not model size - with breakeven intervals of 1-5 minutes. Our results identify a constraint consistent across all tested architectures: idle-with-context power is determined by DVFS state, not memory occupancy.

翻译：AI推理行业为确保模型在GPU内存中全天候加载以避免冷启动延迟，将空闲功耗视为就绪状态的固定成本。然而，这种成本的结构从未被经验性地分解过——更未跨越不同GPU架构进行过分析。我们首次实现了跨架构的空闲GPU功耗测量，将其作为VRAM分配的函数，结合了18天的生产遥测数据（335,267个样本，14块H100 GPU）与覆盖三种内存技术的三类GPU架构上的可控剂量-响应实验：NVIDIA H100（HBM3，80 GB）、A100（HBM2e，80 GB）和L40S（GDDR6，48 GB）。我们观察到，在所有三种架构上，空闲功耗呈分段常数特性：CUDA上下文强制触发离散的DVFS状态切换，在纯空闲基础上额外消耗26-66 W（HBM架构上为26-50 W，GDDR6上为66 W），而边际VRAM效应在每个测试设备上均低于测量相关阈值（|β| < 0.02 W/GB）。无论内存技术如何，CUDA上下文占停车税总额的98%以上。我们通过在所有三种架构上运行真实HuggingFace模型（Qwen2.5-7B）验证了这一发现，确认每个设备上与空张量的差异小于0.5 W，并捕获了模型加载过程中的冷启动功耗曲线。我们推导出一个冷启动盈亏平衡模型，表明能量最优行为取决于请求到达率和加载延迟——而非模型大小——平衡区间为1-5分钟。我们的结果识别出一个在所有测试架构上一致的约束条件：带上下文的空闲功耗由DVFS状态决定，而非内存占用率。