After all these years and all these other shared memory programming frameworks, OpenMP is still the most popular one. However, its greater levels of non-deterministic execution makes debugging and testing more challenging. The ability to record and deterministically replay the program execution is key to address this challenge. However, scalably replaying OpenMP programs is still an unresolved problem. In this paper, we propose two novel techniques that use Distributed Clock (DC) and Distributed Epoch (DE) recording schemes to eliminate excessive thread synchronization for OpenMP record and replay. Our evaluation on representative HPC applications with ReOMP, which we used to realize DC and DE recording, shows that our approach is 2-5x more efficient than traditional approaches that synchronize on every shared-memory access. Furthermore, we demonstrate that our approach can be easily combined with MPI-level replay tools to replay non-trivial MPI+OpenMP applications. We achieve this by integrating \toolname into ReMPI, an existing scalable MPI record-and-replay tool, with only a small MPI-scale-independent runtime overhead.
翻译:尽管历经多年发展并涌现出诸多其他共享内存编程框架,OpenMP 仍是最流行的并行编程模型之一。然而,其较高程度的非确定性执行特性使得调试与测试更具挑战性。记录程序执行过程并实现确定性回放的能力是应对这一挑战的关键。然而,如何实现可扩展的OpenMP程序回放仍是一个悬而未决的问题。本文提出两种创新技术,分别采用分布式时钟(DC)与分布式时段(DE)记录方案,以消除OpenMP记录与回放过程中过度的线程同步开销。我们通过ReOMP(我们实现DC与DE记录方案的工具)对典型高性能计算应用进行评估,结果表明:相较于传统在每次共享内存访问时进行同步的方法,我们的方案效率提升2-5倍。此外,我们证明了该方案可轻松与MPI级回放工具结合,实现对复杂MPI+OpenMP混合应用的完整回放。通过将\toolname集成至现有可扩展MPI记录回放工具ReMPI中,我们仅需引入少量与MPI规模无关的运行时开销即可实现这一目标。