Checkpoint/Restart (C/R) has been widely deployed in numerous HPC systems, Clouds, and industrial data centers, which are typically operated by system engineers. Nevertheless, there is no existing approach that helps system engineers without domain expertise, and domain scientists without system fault tolerance knowledge identify those critical variables accounted for correct application execution restoration in a failure for C/R. To address this problem, we propose an analytical model and a tool (AutoCheck) that can automatically identify critical variables to checkpoint for C/R. AutoCheck relies on first, analytically tracking and optimizing data dependency between variables and other application execution state, and second, a set of heuristics that identify critical variables for checkpointing from the refined data dependency graph (DDG). AutoCheck allows programmers to pinpoint critical variables to checkpoint quickly within a few minutes. We evaluate AutoCheck on 14 representative HPC benchmarks, demonstrating that AutoCheck can efficiently identify correct critical variables to checkpoint.
翻译:检查点/重启(C/R)技术已广泛应用于众多由系统工程师运维的高性能计算系统、云平台和工业数据中心。然而,目前尚缺乏有效方法帮助不具备领域专业知识的系统工程师,以及缺乏系统容错知识的领域科学家,在故障发生时识别那些对C/R中正确恢复应用程序执行至关重要的关键变量。为解决此问题,我们提出一种分析模型及配套工具(AutoCheck),能够自动识别C/R中需要设置检查点的关键变量。AutoCheck首先通过分析跟踪并优化变量与其他应用程序执行状态间的数据依赖关系,其次采用一组启发式规则从精炼后的数据依赖图(DDG)中识别需设置检查点的关键变量。该工具使程序员能在数分钟内快速定位需设置检查点的关键变量。我们在14个代表性HPC基准测试程序上评估AutoCheck,结果表明该工具能高效识别正确的检查点关键变量。