Checkpoint/Restart (C/R) saves the running state of the programs periodically, which consumes considerable system resources. We observe that not every piece of data is involved in the computation in typical HPC applications; such unused data should be excluded from checkpointing for better storage/compute efficiency. To find out, we propose a systematic approach that leverages automatic differentiation (AD) to scrutinize every element within variables (e.g., arrays) for checkpointing allowing us to identify critical/uncritical elements and eliminate uncritical elements from checkpointing. Specifically, we inspect every single element within a variable for checkpointing with an AD tool to determine whether the element has an impact on the application output or not. We empirically validate our approach with eight benchmarks from the NAS Parallel Benchmark (NPB) suite. We successfully visualize critical/uncritical elements/regions within a variable with respect to its impact (yes or no) on the application output. We find patterns/distributions of critical/uncritical elements/regions quite interesting and follow the physical formulation/logic of the algorithm.The evaluation on NPB benchmarks shows that our approach saves storage for checkpointing by up to 20%.
翻译:检查点/重启(C/R)机制通过定期保存程序运行状态来实现容错,但会消耗大量系统资源。我们观察到,在典型的高性能计算应用中,并非所有数据都参与实际计算;这类未使用数据应从检查点中排除,以提升存储与计算效率。为此,我们提出一种系统性方法:利用自动微分(AD)技术对变量(如数组)中的每个元素进行细粒度分析,从而区分关键/非关键元素,并将非关键元素从检查点中剔除。具体而言,我们借助AD工具逐元素检测待检查点变量,判定该元素是否对应用程序输出产生影响。我们使用NAS并行基准测试套件(NPB)中的八个基准程序对方法进行了实证验证。该方法成功实现了对变量内关键/非关键元素(或区域)的可视化呈现,并依据其对程序输出的影响(是/否)进行标注。我们发现关键/非关键元素(或区域)的分布模式具有显著规律性,且与算法的物理模型及逻辑结构高度吻合。在NPB基准测试上的评估表明,本方法最高可减少20%的检查点存储开销。