Accelerated computing is widely used in high-performance computing. Therefore, it is crucial to experiment and discover how to better utilize GPUGPUs latest generations on relevant applications. In this paper, we present results and share insights about highly tuned stencil-based kernels for NVIDIA Ampere (A100) and Hopper (GH200) architectures. Performance results yield useful insights into the behavior of this type of algorithms for these new accelerators. This knowledge can be leveraged by many scientific applications which involve stencils computations. Further, evaluation of three different programming models: CUDA, OpenACC, and OpenMP target offloading is conducted on aforementioned accelerators. We extensively study the performance and portability of various kernels under each programming model and provide corresponding optimization recommendations. Furthermore, we compare the performance of different programming models on the mentioned architectures. Up to 58% performance improvement was achieved against the previous GPGPU's architecture generation for an highly optimized kernel of the same class, and up to 42% for all classes. In terms of programming models, and keeping portability in mind, optimized OpenACC implementation outperforms OpenMP implementation by 33%. If portability is not a factor, our best tuned CUDA implementation outperforms the optimized OpenACC one by 2.1x.
翻译:加速计算在高性能计算中被广泛使用,因此,针对相关应用,探索如何更好地利用最新一代GPU至关重要。本文展示了针对NVIDIA Ampere(A100)和Hopper(GH200)架构的高度优化的基于模板的内核的性能结果与见解。这些性能结果为这类算法在新加速器上的行为提供了有价值的洞察,这些知识可被许多涉及模板计算的科学应用所利用。此外,我们还在前述加速器上评估了三种不同的编程模型:CUDA、OpenACC和OpenMP目标卸载。我们广泛研究了每种编程模型下各个内核的性能与可移植性,并提供了相应的优化建议。进一步地,我们比较了不同编程模型在上述架构上的性能。对于同一类的高度优化内核,较上一代GPGPU架构实现了最高58%的性能提升,所有类别平均提升达42%。在编程模型方面,考虑到可移植性,优化后的OpenACC实现比OpenMP实现性能提升33%。若不考虑可移植性,我们最佳调优的CUDA实现比优化后的OpenACC实现性能高出2.1倍。