Scaling to larger systems, with current levels of reliability, requires cost-effective methods to mitigate hardware failures. One of the main causes of hardware failure is an uncorrected error in memory, which terminates the current job and wastes all computation since the last checkpoint. This paper presents the first adaptive method for triggering uncorrected error mitigation. It uses a prediction approach that considers the likelihood of an uncorrected error and its current potential cost. The method is based on reinforcement learning, and the only user-defined parameters are the mitigation cost and whether the job can be restarted from a mitigation point. We evaluate our method using classical machine learning metrics together with a cost-benefit analysis, which compares the cost of mitigation actions with the benefits from mitigating some of the errors. On two years of production logs from the MareNostrum supercomputer, our method reduces lost compute time by 54% compared with no mitigation and is just 6% below the optimal Oracle method. All source code is open source.
翻译:在保持当前可靠性水平的前提下扩展至更大规模系统,需要具有成本效益的硬件故障缓解方法。硬件故障的主要成因之一是内存中的未纠正错误,此类错误会终止当前作业并导致自上次检查点以来的所有计算成果失效。本文提出了首个用于触发未纠正错误缓解的自适应方法。该方法采用预测性策略,综合考虑未纠正错误的发生概率及其当前潜在代价。本方法基于强化学习框架,用户仅需定义缓解成本及作业是否可从缓解点重启两个参数。我们通过经典机器学习指标结合成本效益分析对方法进行评估,其中成本效益分析将缓解操作的成本与错误缓解带来的收益进行对比。基于MareNostrum超级计算机两年生产日志的测试表明:相较于无缓解措施,本方法可减少54%的计算时间损失,其性能仅比最优Oracle方法低6%。全部源代码均已开源。