The increasing size of HPC architectures makes the faults' presence a more and more frequent eventuality. This issue becomes especially relevant since MPI, the de-facto standard for inter-process communication, lacks proper fault management functionalities. Past efforts produced extensions to the MPI standard that enabled fault management, including ULFM. While providing powerful tools to handle faults, it still faces limitations like the collectiveness of the repair procedure. With this paper, we overcome those limitations and achieve fault-aware non-collective communicator creation and reparation. We integrate our solution into an existing fault resiliency framework and measure the overhead introduced in the application code. The experimental campaign shows that our solution is scalable and introduces a limited overhead, and the non-collective reparation is a viable opportunity for ULFM-based applications.
翻译:高性能计算(HPC)架构规模的日益增长使得故障的发生愈发频繁。这一问题尤为重要,因为作为进程间通信事实标准的MPI缺乏完善的故障管理功能。过去的努力产生了MPI标准的扩展以支持故障管理,包括ULFM。尽管提供了强大的故障处理工具,但它仍面临诸如修复过程的集合性等限制。通过本文,我们克服了这些限制,实现了面向故障的非集合通信器创建与修复。我们将解决方案集成到一个现有的故障恢复框架中,并测量了应用程序代码中引入的开销。实验表明,我们的解决方案具有可扩展性且引入的开销有限,非集合修复为基于ULFM的应用提供了可行的机遇。