The transition of scientific applications to GPU-accelerated exascale systems is constrained by trade-offs between performance, portability, and productivity. This work evaluates the performance portability of directive-based GPU programming by porting gPLUTO, a production-grade magnetohydrodynamics code for astrophysical simulations, from OpenACC to OpenMP, and analyzing its performance on NVIDIA A100 (Leonardo Booster) and AMD MI250X (LUMI-G) devices. On NVIDIA platforms, OpenACC and OpenMP achieve comparable performance due to a shared compiler backend, providing a consistent baseline for assessing algorithmic efficiency. In contrast, the same OpenMP implementation is approximately three times slower at the application level on AMD MI250X with respect to the NVIDIA A100 OpenACC baseline, with kernel-level slowdowns reaching up to an order of magnitude, driven by sensitivity to strided memory-access patterns and compiler limitations. Kernel-level profiling shows that the dominant contributors to run-time are memory-latency-bound rather than limited by peak band-width. In low-parallelism kernels, C++ abstraction layers increase register pressure and spilling, leading to extreme slowdowns of up to 47x in specific cases. These results indicate that portable performance across GPU architectures requires not only application-level changes but also continued advances in compiler backends and architecture-aware optimization strategies
翻译:科学应用向GPU加速的百亿亿次系统过渡受到性能、可移植性和生产力之间权衡的限制。本研究通过将用于天体物理模拟的生产级磁流体动力学代码gPLUTO从OpenACC移植到OpenMP,并分析其在NVIDIA A100(Leonardo Booster)和AMD MI250X(LUMI-G)设备上的性能,评估了基于指令的GPU编程的性能可移植性。在NVIDIA平台上,由于共享编译器后端,OpenACC和OpenMP实现了可比的性能,为评估算法效率提供了一致基准。相比之下,相同的OpenMP实现在AMD MI250X上的应用级别比NVIDIA A100的OpenACC基准慢约三倍,内核级别减速高达一个数量级,这由对跨步内存访问模式的敏感性和编译器限制驱动。内核级别性能分析显示,运行时的主要贡献者受内存延迟约束,而非峰值带宽限制。在低并行度内核中,C++抽象层增加了寄存器压力和溢出,导致特定情况下高达47倍的极端减速。这些结果表明,跨GPU架构的可移植性能不仅需要应用级别的变更,还需要编译器后端和架构感知优化策略的持续进步。