Finite element simulations play a critical role in a wide range of applications, from automotive design to tsunami modeling and computational electromagnetics. Performing these simulations efficiently at the high resolutions needed for practical applications and scientific insights necessitates the use of high-order methods and large-scale supercomputing. While much progress has been made in porting finite element codes to GPU systems in recent years, additional improvements in the efficiency and computational speed of GPU-accelerated high-order finite element simulations are in constant demand. In this paper, we demonstrate that the FP64 tensor cores on NVIDIA GPUs can be used to further accelerate such simulations, achieving significant speedups in key kernels of MFEM, a scalable open-source finite element library widely used in HPC applications. By integrating FP64 tensor cores with kernel fusion optimizations, we were able to achieve up to 2$\times$ performance gains and up to 83% energy efficiency gains on NVIDIA's Grace Hopper GH200 and Grace Blackwell GB200 architectures. To the best of our knowledge, this is the first time that FP64 tensor cores have been directly programmed to accelerate large-scale finite element scientific computing applications. We demonstrate the performance of the optimized kernels at exascale by showing near-perfect weak scaling efficiency and 90% strong scaling efficiency across nearly 10,000 GPUs on the Alps system. The new algorithms and MFEM enhancements directly benefit complex production codes, including the 2025 Gordon Bell Prize-winning application for real-time tsunami forecasting.
翻译:有限元仿真在从汽车设计到海啸建模和计算电磁学等广泛应用中发挥着关键作用。为了在实际应用和科学洞察所需的高分辨率下高效执行这些仿真,必须采用高阶方法和大规模超级计算。尽管近年来在将有限元代码移植到GPU系统方面取得了很大进展,但对GPU加速的高阶有限元仿真的效率和计算速度的进一步提升需求始终存在。本文中,我们证明了NVIDIA GPU上的FP64张量核可用于进一步加速此类仿真,在MFEM(一个广泛用于HPC应用的可扩展开源有限元库)的关键核心中实现了显著的加速。通过将FP64张量核与内核融合优化相结合,我们在NVIDIA的Grace Hopper GH200和Grace Blackwell GB200架构上实现了高达2倍的性能提升和高达83%的能效提升。据我们所知,这是首次直接编程利用FP64张量核来加速大规模有限元科学计算应用。我们通过在Alps系统上近10,000个GPU上展示近乎完美的弱扩展效率和90%的强扩展效率,证明了优化后的核心在百亿亿次级规模上的性能。新算法和MFEM增强功能直接惠及复杂的生产代码,包括获得2025年戈登·贝尔奖的实时海啸预报应用。