We present a GPU implementation of vertex-patch smoothers for higher order finite element methods in two and three dimensions. Analysis shows that they are not memory bound with respect to GPU DRAM, but with respect to on-chip scratchpad memory. Multigrid operations are optimized through localization and reorganized local operations in on-chip memory, achieving minimal global data transfer and a conflict free memory access pattern. Performance tests demonstrate that the optimized kernel is at least 2 times faster than the straightforward implementation for the Poisson problem, across various polynomial degrees in 2D and 3D, achieving up to 36% of the peak performance in both single and double precision on Nvidia A100 GPU.
翻译:本文提出了一种用于二维和三维高阶有限元方法的顶点块平滑器的GPU实现。分析表明,其性能瓶颈并非GPU DRAM内存带宽,而是片上暂存内存。通过局部化处理及在片内内存中重组局部操作,我们优化了多重网格运算,实现了最小的全局数据传输和无冲突的内存访问模式。性能测试表明,在二维和三维不同多项式阶数的泊松问题求解中,优化后的内核比直接实现至少快2倍,在Nvidia A100 GPU上单精度和双精度计算均能达到峰值性能的36%。