Taking snapshots of the state of a distributed computation is useful for off-line analysis of the computational state, for later restarting from the saved snapshot, for cloning a copy of the computation, and for migration to a new cluster. The problem is made more difficult when supporting collective operations across processes, such as barrier, reduce operations, scatter and gather, etc. Some processes may have reached the barrier or other collective operation, while other processes wait a long time to reach that same barrier or collective operation. At least two solutions are well-known in the literature: (I) draining in-flight network messages and then freezing the network at checkpoint time; and (ii) adding a barrier prior to the collective operation, and either completing the operation or aborting the barrier if not all processes are present. Both solutions suffer important drawbacks. The code in the first solution must be updated whenever one ports to a newer network. The second solution implies additional barrier-related network traffic prior to each collective operation. This work presents a third solution that avoids both drawbacks. There is no additional barrier-related traffic, and the solution is implemented entirely above the network layer. The work is demonstrated in the context of transparent checkpointing of MPI libraries for parallel computation, where each of the first two solutions have already been used in prior systems, and then abandoned due to the aforementioned drawbacks. Experiments demonstrate the low runtime overhead of this new, network-agnostic approach. The approach is also extended to non-blocking, collective operations in order to handle overlapping of computation and communication.
翻译:对分布式计算状态进行快照有助于离线分析计算状态、后续从保存的快照重启、克隆计算副本以及迁移至新集群。当需要支持进程间的集合操作(如栅障、规约操作、分散与收集等)时,该问题变得更加复杂。部分进程可能已到达栅障或其他集合操作点,而其他进程需等待较长时间才能到达同一栅障或集合操作点。文献中至少存在两种已知解决方案:(I)排空传输中的网络消息,随后在检查点时刻冻结网络;(II)在集合操作前添加栅障,当未满足全部进程到达时要么完成操作,要么中止栅障。两种方案均存在显著缺陷:第一种方案的代码在移植至更新型网络时需更新;第二种方案会在每次集合操作前引入额外的栅障相关网络流量。本研究提出第三种解决方案,避免了上述两种缺陷:既无额外栅障相关流量,且解决方案完全在网络层之上实现。该工作在面向并行计算的MPI库透明检查点场景中得到验证——虽然前两种方案均曾被早期系统采用,但因前述缺陷最终被弃用。实验证明这种新型网络无关方法具有极低的运行时开销。该方法还被扩展至非阻塞集合操作,以支持计算与通信的重叠处理。