The latest version of MPI introduces new functionalities like the Session model, but it still lacks fault management mechanisms. Past efforts produced tools and MPI standard extensions to manage fault presence, including ULFM. These measures are effective against faults but do not fully support the new additions to the standard. In this paper, we combine the fault management possibilities of ULFM with the new Session model functionality introduced in version 4.0 of the standard. We focus on the communicator creation procedure, highlighting criticalities and proposing a method to circumvent them. The experimental campaign shows that the proposed solution does not significantly affect applications' execution time and scalability while better managing the insurgence of faults.
翻译:最新版本的MPI引入了会话模型等新功能,但其仍缺乏故障管理机制。过去的研究成果开发了多种工具和MPI标准扩展(包括ULFM)来应对故障情况。这些措施虽能有效处理故障,但未能完全支持标准新增功能模块。本文结合ULFM的故障管理能力与MPI 4.0标准中新增的会话模型功能,重点研究通信子创建流程,揭示其中关键问题并提出规避方法。实验结果表明,本方案在有效管理故障爆发的同时,不会显著影响应用程序的执行时间与可扩展性。