As Exascale computing becomes a reality, the energy needs of compute nodes in cloud data centers will continue to grow. A common approach to reducing this energy demand is to limit the power consumption of hardware components when workloads are experiencing bottlenecks elsewhere in the system. However, designing a resource controller capable of detecting and limiting power consumption on-the-fly is a complex issue and can also adversely impact application performance. In this paper, we explore the use of Reinforcement Learning (RL) to design a power capping policy on cloud compute nodes using observations on current power consumption and instantaneous application performance (heartbeats). By leveraging the Argo Node Resource Management (NRM) software stack in conjunction with the Intel Running Average Power Limit (RAPL) hardware control mechanism, we design an agent to control the maximum supplied power to processors without compromising on application performance. Employing a Proximal Policy Optimization (PPO) agent to learn an optimal policy on a mathematical model of the compute nodes, we demonstrate and evaluate using the STREAM benchmark how a trained agent running on actual hardware can take actions by balancing power consumption and application performance.
翻译:随着百亿亿次计算成为现实,云数据中心计算节点的能源需求将持续增长。降低这一能耗的常见方法是在工作负载遭遇系统其他部分瓶颈时,限制硬件组件的功耗。然而,设计一个能够实时检测并限制功耗的资源控制器是一项复杂任务,并可能对应用性能产生不利影响。本文探讨利用强化学习(RL)设计云计算节点上的功率上限策略,通过观察当前功耗与实时应用性能(心跳信号)来制定决策。借助Argo节点资源管理(NRM)软件栈及英特尔运行平均功率限制(RAPL)硬件控制机制,我们设计了一个智能体,在不对应用性能造成妥协的前提下,控制处理器的最大供电功率。通过采用近端策略优化(PPO)智能体在计算节点数学模型上学习最优策略,我们利用STREAM基准测试验证并评估了训练后的智能体如何在真实硬件上通过平衡功耗与应用性能来采取行动。