Accelerated computing is widely used in high-performance computing. Therefore, it is crucial to experiment and discover how to better utilize GPUGPUs latest generations on relevant applications. In this paper, we present results and share insights about highly tuned stencil-based kernels for NVIDIA Ampere (A100) and Hopper (GH200) architectures. Performance results yield useful insights into the behavior of this type of algorithms for these new accelerators. This knowledge can be leveraged by many scientific applications which involve stencils computations. Further, evaluation of three different programming models: CUDA, OpenACC, and OpenMP target offloading is conducted on aforementioned accelerators. We extensively study the performance and portability of various kernels under each programming model and provide corresponding optimization recommendations. Furthermore, we compare the performance of different programming models on the mentioned architectures. Up to 58% performance improvement was achieved against the previous GPGPU's architecture generation for an highly optimized kernel of the same class, and up to 42% for all classes. In terms of programming models, and keeping portability in mind, optimized OpenACC implementation outperforms OpenMP implementation by 33%. If portability is not a factor, our best tuned CUDA implementation outperforms the optimized OpenACC one by 2.1x.
翻译:加速计算在高性能计算领域已得到广泛应用。因此,通过实验探索如何在相关应用中更好地利用最新一代GPU至关重要。本文针对NVIDIA安培(A100)和霍珀(GH200)架构,展示了高度优化的基于模板计算核心的性能结果并分享了相关见解。性能结果为理解此类算法在这些新型加速器上的行为特征提供了重要参考,可被众多涉及模板计算的科学应用所借鉴。此外,我们在上述加速器上对三种编程模型——CUDA、OpenACC和OpenMP目标卸载——进行了评估。我们深入研究了各编程模型下不同计算核心的性能与可移植性,并提供了相应的优化建议。同时,我们还比较了不同编程模型在上述架构上的性能表现。对于同类高度优化的计算核心,相比前代GPGPU架构实现了最高58%的性能提升,所有类别计算核心平均提升达42%。在编程模型方面,兼顾可移植性的优化OpenACC实现比OpenMP实现性能高出33%;若不考虑可移植性,我们最优调的CUDA实现性能达到优化OpenACC实现的2.1倍。