Power management in multi-server data centers~especially at scale is a vital issue of increasing importance in cloud computing paradigm. Existing studies mostly consider thresholds on the number of idle servers to switch the servers on or off and suffer from scalability issues. As a natural approach in view~of~the Markovian assumption, we present a multi-level continuous-time Markov decision process (CTMDP) model based on state aggregation of multi-server data centers with setup times that interestingly overcomes the inherent intractability of traditional MDP approaches due to their colossal state-action space. The beauty of the presented model is that, while it keeps loyalty to the Markovian behavior, it approximates the calculation of the transition probabilities in a way that keeps the accuracy of the results at a desirable level. Moreover, near-optimal performance is attained at the expense of the increased state-space dimensionality by tuning the number of levels in the multi-level approach. The simulation results were promising and confirm that in many scenarios of interest, the proposed approach attains noticeable improvements, namely a near 50% reduction in the size of CTMDP while yielding better rewards as compared to existing fixed threshold-based policies and aggregation methods.
翻译:多服务器数据中心的电源管理(尤其是在大规模场景下)是云计算范式中日益重要的关键问题。现有研究通常基于空闲服务器数量的阈值来切换服务器开关状态,但存在可扩展性不足的问题。基于马尔可夫性假设的自然思路,我们提出了一种多层级连续时间马尔可夫决策过程模型,该模型通过对含启动时间的多服务器数据中心进行状态聚合,令人瞩目地克服了传统MDP方法因状态-动作空间过于庞大而固有的不可解性。该模型的精妙之处在于:在保持对马尔可夫行为忠实性的前提下,通过近似计算转移概率的方式,使得结果精度维持在理想水平。此外,通过调整多层级方法中的层级数量来增加状态空间维度,即可实现近最优性能。仿真结果令人振奋,证实了在多种关键场景下,所提方法相较于现有固定阈值策略与聚合方法取得了显著改进——在获得更优回报的同时,将CTMDP的规模缩减近50%。