Online GPU Energy Optimization with Switching-Aware Bandits

Energy consumption has become a bottleneck for future computing architectures, from wearable devices to leadership-class supercomputers. Existing energy management techniques largely target CPUs, even though GPUs now dominate power draw in heterogeneous high performance computing (HPC) systems. Moreover, many prior methods rely on either purely offline or hybrid offline and online training, which is impractical and results in energy inefficiencies during data collection. In this paper, we introduce a practical online GPU energy optimization problem in a HPC scenarios. The problem is challenging because (1) GPU frequency scaling exhibits performance-energy trade-offs, (2) online control must balance exploration and exploitation, and (3) frequent frequency switching incurs non-trivial overhead and degrades quality of service (QoS). To address the challenges, we formulate online GPU energy optimization as a multi-armed bandit problem and propose EnergyUCB, a lightweight UCB-based controller that dynamically adjusts GPU core frequency in real time to save energy. Specifically, EnergyUCB (1) defines a reward that jointly captures energy and performance using a core-to-uncore utilization ratio as a proxy for GPU throughput, (2) employs optimistic initialization and UCB-style confidence bonuses to accelerate learning from scratch, and (3) incorporates a switching-aware UCB index and a QoS-constrained variant that enforce explicit slowdown budgets while discouraging unnecessary frequency oscillations. Extensive experiments on real-world workloads from the world's third fastest supercomputer Aurora show that EnergyUCB achieves substantial energy savings with modest slowdown and that the QoS-constrained variant reliably respects user-specified performance budgets.

翻译：能耗已成为从可穿戴设备到顶级超级计算机等未来计算架构的瓶颈。现有的能耗管理技术主要针对CPU，尽管GPU目前在异构高性能计算（HPC）系统中占据功耗主导地位。此外，许多现有方法依赖于纯离线或混合离线与在线训练，这在实践中不可行，并导致数据收集阶段的能效低下。本文针对HPC场景提出了一种实用的在线GPU能耗优化问题。该问题具有挑战性，因为（1）GPU频率调节存在性能与能耗的权衡；（2）在线控制必须平衡探索与利用；（3）频繁的频率切换会产生显著开销并降低服务质量（QoS）。为应对这些挑战，我们将在线GPU能耗优化建模为多臂赌博机问题，并提出EnergyUCB——一种基于UCB的轻量级控制器，可实时动态调整GPU核心频率以节约能耗。具体而言，EnergyUCB（1）通过核心与非核心利用率之比作为GPU吞吐量的代理，定义了一个同时捕获能耗与性能的奖励函数；（2）采用乐观初始化和UCB式置信区间奖励以加速从零开始的学习过程；（3）引入了考虑切换开销的UCB指标及QoS约束变体，在抑制不必要频率振荡的同时强制执行明确的性能降级预算。基于全球第三快超级计算机Aurora真实工作负载的大量实验表明，EnergyUCB能以较小的性能降幅实现显著的节能效果，且其QoS约束变体能可靠地满足用户指定的性能预算。