There is growing interest in using standard language constructs for accelerated computing, avoiding the need for (often vendor-specific) external APIs. These constructs hold the potential to be more portable and much more `future-proof'. For Fortran codes, the current focus is on the {\tt do concurrent} (DC) loop. While there have been some successful examples of GPU-acceleration using DC for benchmark and/or small codes, its widespread adoption will require demonstrations of its use in full-size applications. Here, we look at the current capabilities and performance of using DC in a production application called Magnetohydrodynamic Algorithm outside a Sphere (MAS). MAS is a state-of-the-art model for studying coronal and heliospheric dynamics, is over 70,000 lines long, and has previously been ported to GPUs using MPI+OpenACC. We attempt to eliminate as many of its OpenACC directives as possible in favor of DC. We show that using the NVIDIA {\tt nvfortran} compiler's Fortran 202X preview implementation, unified managed memory, and modified MPI launch methods, we can achieve GPU acceleration across multiple GPUs without using a single OpenACC directive. However, doing so results in a slowdown between 1.25x and 3x. We discuss what future improvements are needed to avoid this loss, and show how we can still retain close
翻译:人们对使用标准语言结构进行加速计算日益关注,这避免了依赖(通常由供应商特定的)外部API的必要性。这些结构有望具有更好的可移植性和更强的“面向未来性”。对于Fortran代码,当前焦点集中在{\tt do concurrent}(DC)循环上。尽管已有一些使用DC对基准测试和/或小型代码进行GPU加速的成功案例,但其广泛应用仍需证明在完整规模应用中的使用效果。本文探讨了在生产级应用——球外磁流体动力学算法(MAS)中使用DC的当前能力和性能。MAS是研究日冕和日球层动力学的先进模型,代码长度超过7万行,此前已通过MPI+OpenACC移植到GPU。我们尝试尽可能用DC替换其OpenACC指令。研究表明,使用NVIDIA {\tt nvfortran}编译器的Fortran 202X预览实现、统一内存管理以及修改后的MPI启动方法,可以在不使用任何OpenACC指令的情况下实现跨多GPU的加速。然而,这导致1.25倍至3倍的性能下降。我们讨论了为避免这种损失未来需要改进的方向,并展示了如何仍能保持接近原性能。