In this work, we introduce a new algorithm for N-to-M checkpointing in finite element simulations. This new algorithm allows efficient saving/loading of functions representing physical quantities associated with the mesh representing the physical domain. Specifically, the algorithm allows for using different numbers of parallel processes for saving and loading, allowing for restarting and post-processing on the process count appropriate to the given phase of the simulation and other conditions. For demonstration, we implemented this algorithm in PETSc, the Portable, Extensible Toolkit for Scientific Computation, and added a convenient high-level interface into Firedrake, a system for solving partial differential equations using finite element methods. We evaluated our new implementation by saving and loading data involving 8.2 billion finite element degrees of freedom using 8,192 parallel processes on ARCHER2, the UK National Supercomputing Service.
翻译:本文提出了一种适用于有限元仿真的新型N对M检查点算法。该算法能够高效保存/加载与物理域网格相关的物理量函数。具体而言,该算法允许在保存和加载过程中使用不同数量的并行进程,从而能够根据仿真阶段及其他条件,以适配的进程数进行重启和后处理。为验证该算法,我们在可移植可扩展科学计算工具包PETSc中实现了该算法,并在基于有限元方法求解偏微分方程的Firedrake系统中添加了便捷的高层接口。我们通过在英国国家超级计算服务ARCHER2上使用8192个并行进程保存/加载包含82亿有限元自由度的数据,评估了该新实现方案。