In this paper, we explore the acceleration of tensor product operations in finite element methods, leveraging the computational power of the NVIDIA A100 GPU Tensor Cores. We provide an accessible overview of the necessary mathematical background and discuss our implementation strategies. Our study focuses on two common programming approaches for NVIDIA Tensor Cores: the C++ Warp Matrix Functions in nvcuda::wmma and the inline Parallel Thread Execution (PTX) instructions mma.sync.aligned. A significant focus is placed on the adoption of the versatile inline PTX instructions combined with a conflict-free shared memory access pattern, a key to unlocking superior performance. When benchmarked against traditional CUDA Cores, our approach yields a remarkable 2.3-fold increase in double precision performance, achieving 8 TFLOPS/s-45% of the theoretical maximum. Furthermore, in half-precision computations, numerical experiments demonstrate a fourfold enhancement in solving the Poisson equation using the flexible GMRES (FGMRES) method, preconditioned by a multigrid method in 3D. This is achieved while maintaining the same discretization error as observed in double precision computations. These results highlight the considerable benefits of using Tensor Cores for finite element operators with tensor products, achieving an optimal balance between computational speed and precision.
翻译:本文探讨了在有限元方法中利用NVIDIA A100 GPU张量核心的计算能力加速张量积运算。我们对必要的数学背景进行了通俗易懂的概述,并讨论了我们的实现策略。我们的研究聚焦于两种常见的NVIDIA张量核心编程方法:nvcuda::wmma中的C++ Warp Matrix Functions以及内联并行线程执行(PTX)指令mma.sync.aligned。研究重点在于采用多功能的内联PTX指令结合无冲突的共享内存访问模式,这是实现卓越性能的关键。与传统CUDA核心进行基准测试时,我们的方法在双精度性能上实现了2.3倍的显著提升,达到8 TFLOPS/s——相当于理论最大值的45%。此外,在半精度计算中,数值实验表明,使用灵活的广义最小残差(FGMRES)方法求解泊松方程时,在三维空间中采用多重网格预条件子,性能提升了四倍。这一成果是在保持与双精度计算相同离散误差的前提下实现的。这些结果突显了使用张量核心处理具有张量积的有限元算子所带来的显著优势,在计算速度与精度之间实现了最佳平衡。