As supercomputers grow in hardware complexity, their susceptibility to faults increases and measures need to be taken to ensure the correctness of results. Some numerical algorithms have certain characteristics that allow them to recover from some types of faults. It has been demonstrated that adaptive Runge-Kutta methods provide resilience against transient faults without adding computational cost. Using recent advances in adaptive step size selection for spectral deferred correction (SDC), an iterative numerical time stepping scheme that can produce methods of arbitrary order, we show that adaptive SDC can also detect and correct transient faults. Its performance is found to be comparable to that of the dedicated resilience strategy Hot Rod.
翻译:随着超级计算机硬件复杂度的提升,其故障敏感性也随之增加,需要采取措施确保计算结果的正确性。部分数值算法具备特定特性,使其能够从某些类型的故障中恢复。已有研究表明,自适应龙格-库塔方法无需增加计算成本即可实现对瞬态故障的鲁棒性。基于谱延迟校正(SDC)——一种可生成任意阶方法的迭代数值时间步进格式——在自适应步长选择方面的最新进展,本文论证了自适应SDC同样能够检测并修正瞬态故障。其性能表现与专用容错策略Hot Rod具有可比性。