Nowadays, improving the energy efficiency of high-performance computing (HPC) systems is one of the main drivers in scientific and technological research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be explored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure occurs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have enriched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest number of candidate processes to be analyzed. We have called the latter as \textit{cascade analysis}, because it includes processes that gets blocked by communication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90\% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.
翻译:当前,提升高性能计算(HPC)系统的能效是科学技术研究的主要驱动力之一。由于大规模HPC系统需要某种容错方法,因此应探索降低能耗的机遇。特别是,采用非协调检查点的回滚恢复机制可避免故障发生时所有进程重新执行。在此背景下,可采取行动降低非重执行进程所在节点的能耗。本工作是对前期研究的扩展,我们曾提出一系列故障时刻能耗管理策略。本研究中,我们通过引入非阻塞通信(含/不含系统缓冲)并扩大待分析候选进程数量,增强了仿真器与实验设计。我们将后者称为级联分析,因其涵盖因与故障进程间接通信而阻塞的进程。仿真表明,最差情况下节能效果微乎其微,但在某些场景下可实现显著节能:在16分钟时间间隔内最大节能达90%。由此,我们证实了故障情况下提升HPC系统能效的可行性。