A Reinforcement Learning Approach for Performance-aware Reduction in Power Consumption of Data Center Compute Nodes

As Exascale computing becomes a reality, the energy needs of compute nodes in cloud data centers will continue to grow. A common approach to reducing this energy demand is to limit the power consumption of hardware components when workloads are experiencing bottlenecks elsewhere in the system. However, designing a resource controller capable of detecting and limiting power consumption on-the-fly is a complex issue and can also adversely impact application performance. In this paper, we explore the use of Reinforcement Learning (RL) to design a power capping policy on cloud compute nodes using observations on current power consumption and instantaneous application performance (heartbeats). By leveraging the Argo Node Resource Management (NRM) software stack in conjunction with the Intel Running Average Power Limit (RAPL) hardware control mechanism, we design an agent to control the maximum supplied power to processors without compromising on application performance. Employing a Proximal Policy Optimization (PPO) agent to learn an optimal policy on a mathematical model of the compute nodes, we demonstrate and evaluate using the STREAM benchmark how a trained agent running on actual hardware can take actions by balancing power consumption and application performance.

翻译：随着百亿亿次计算成为现实，云数据中心计算节点的能源需求将持续增长。降低这一能耗的常见方法是在工作负载遭遇系统其他部分瓶颈时，限制硬件组件的功耗。然而，设计一个能够实时检测并限制功耗的资源控制器是一项复杂任务，并可能对应用性能产生不利影响。本文探讨利用强化学习（RL）设计云计算节点上的功率上限策略，通过观察当前功耗与实时应用性能（心跳信号）来制定决策。借助Argo节点资源管理（NRM）软件栈及英特尔运行平均功率限制（RAPL）硬件控制机制，我们设计了一个智能体，在不对应用性能造成妥协的前提下，控制处理器的最大供电功率。通过采用近端策略优化（PPO）智能体在计算节点数学模型上学习最优策略，我们利用STREAM基准测试验证并评估了训练后的智能体如何在真实硬件上通过平衡功耗与应用性能来采取行动。

相关内容

Performance

关注 3

Performance：International Symposium on Computer Performance Modeling, Measurements and Evaluation。 Explanation：计算机性能建模、测量和评估国际研讨会。 Publisher：ACM。 SIT：http://dblp.uni-trier.de/db/conf/performance/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日