Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI communication in memory-bound parallel programs on multicore clusters and how it can be facilitated. For instance, slowing down MPI processes by deliberate injection of delays can improve performance if certain conditions are met. This leads to the counter-intuitive conclusion that noise, independent of its source, is not always detrimental but can be leveraged for performance improvements. We employ phase-space graphs as a new tool to visualize parallel program dynamics. They are useful in spotting certain patterns in parallel execution that will easily go unnoticed with traditional tracing tools. We investigate five different microbenchmarks and applications on different supercomputer platforms: an MPI-augmented STREAM Triad, two implementations of Lattice-Boltzmann fluid solvers, and the LULESH and HPCG proxy applications.
翻译:理解高性能计算集群上高度并行程序所呈现的复杂硬件-软件交互核心性能瓶颈至关重要。本文揭示了多核集群上内存受限并行程序中自动异步MPI通信的问题及其实现途径。例如,在满足特定条件时,通过刻意注入延迟来减缓MPI进程可提升性能。这得出了反直觉的结论:噪声(无论其来源)并非总有害,反而可用于性能优化。我们采用相空间图作为可视化并行程序动态的新工具,它能有效识别传统追踪工具易忽略的并行执行特定模式。我们在不同超级计算机平台上研究了五种微基准测试与应用:MPI增强版STREAM Triad、两种Lattice-Boltzmann流体求解器实现、以及LULESH和HPCG代理应用。