Ensuring high productivity in scientific software development necessitates developing and maintaining a single codebase that can run efficiently on a range of accelerator-based supercomputing platforms. While prior work has investigated the performance portability of a few selected proxy applications or programming models, this paper provides a comprehensive study of a range of proxy applications implemented in the major programming models suitable for GPU-based platforms. We present and analyze performance results across NVIDIA and AMD GPU hardware currently deployed in leadership-class computing facilities using a representative range of scientific codes and several programming models -- CUDA, HIP, Kokkos, RAJA, OpenMP, OpenACC, and SYCL. Based on the specific characteristics of applications tested, we include recommendations to developers on how to choose the right programming model for their code. We find that Kokkos, RAJA, and SYCL in particular offer the most promise empirically as performance portable programming models. These results provide a comprehensive evaluation of the extent to which each programming model for heterogeneous systems provides true performance portability in real-world usage.
翻译:确保科学软件开发的高生产率,需要开发并维护一套能在多种加速器超级计算平台上高效运行的单一代码库。尽管先前的研究已针对少数选定的代理应用或编程模型探讨了性能可移植性,本文则对适用于GPU平台的主要编程模型中实现的系列代理应用进行了全面研究。我们利用一系列具有代表性的科学代码及多种编程模型——CUDA、HIP、Kokkos、RAJA、OpenMP、OpenACC和SYCL,展示了在目前部署于顶级计算设施的NVIDIA和AMD GPU硬件上的性能结果并进行了分析。基于所测试应用的具体特性,我们向开发者提出了如何为代码选择合适编程模型的建议。我们发现,Kokkos、RAJA及SYCL尤其具有实证潜力,可视为性能可移植的编程模型。这些结果全面评估了面向异构系统的每种编程模型在实际应用中实现真正性能可移植性的程度。