In this work, we introduce a new algorithm for N-to-M checkpointing in finite element simulations. This new algorithm allows efficient saving/loading of functions representing physical quantities associated with the mesh representing the physical domain. Specifically, the algorithm allows for using different numbers of parallel processes for saving and loading, allowing for restarting and post-processing on the process count appropriate to the given phase of the simulation and other conditions. For demonstration, we implemented this algorithm in PETSc, the Portable, Extensible Toolkit for Scientific Computation, and added a convenient high-level interface into Firedrake, a system for solving partial differential equations using finite element methods. We evaluated our new implementation by saving and loading data involving 8.2 billion finite element degrees of freedom using 8,192 parallel processes on ARCHER2, the UK National Supercomputing Service.
翻译:本文提出了一种用于有限元模拟的新型N-to-M检查点算法。该算法能够高效保存/加载表示物理域网格相关物理量的函数。特别地,该算法允许使用不同数量的并行进程进行保存和加载操作,从而能够根据模拟阶段的具体需求及其他条件,以合适的进程数重新启动模拟或进行后处理。为验证算法有效性,我们在PETSc(可移植可扩展科学计算工具包)中实现了该算法,并在Firedrake(基于有限元法的偏微分方程求解系统)中集成了便捷的高层接口。通过在ARCHER2(英国国家超级计算服务)上使用8,192个并行进程对包含82亿有限元自由度的数据进行保存与加载测试,我们对新实现方案进行了性能评估。