This paper presents an in-depth examination of checkpoint-restart mechanisms in High-Performance Computing (HPC). It focuses on the use of Distributed MultiThreaded CheckPointing (DMTCP) in various computational settings, including both within and outside of containers. The study is grounded in real-world applications running on NERSC Perlmutter, a state-of-the-art supercomputing system. We discuss the advantages of checkpoint-restart (C/R) in managing complex and lengthy computations in HPC, highlighting its efficiency and reliability in such environments. The role of DMTCP in enhancing these workflows, especially in multi-threaded and distributed applications, is thoroughly explored. Additionally, the paper delves into the use of HPC containers, such as Shifter and Podman-HPC, which aid in the management of computational tasks, ensuring uniform performance across different environments. The methods, results, and potential future directions of this research, including its application in various scientific domains, are also covered, showcasing the critical advancements made in computational methodologies through this study.
翻译:本文对高性能计算(HPC)中的检查点-重启机制进行了深入探讨,重点研究了分布式多线程检查点(DMTCP)在各类计算环境(包括容器内外)中的应用。本研究基于在NERSC Perlmutter(尖端超级计算系统)上运行的实际应用展开。我们论述了检查点-重启机制在管理HPC中复杂长时计算任务方面的优势,着重阐明了其在该类环境中的高效性与可靠性。本文全面探讨了DMTCP在增强多线程及分布式应用工作流中的关键作用。此外,论文深入分析了Shifter与Podman-HPC等HPC容器技术在计算任务管理中的应用,这些技术确保了跨环境性能的一致性。研究同时涵盖了本工作采用的方法、取得的成果及未来潜在发展方向(包括其在多科学领域的应用前景),展示了通过本研究在计算方法论领域取得的重要进展。