The first generation of exascale systems will include a variety of machine architectures, featuring GPUs from multiple vendors. As a result, many developers are interested in adopting portable programming models to avoid maintaining multiple versions of their code. It is necessary to document experiences with such programming models to assist developers in understanding the advantages and disadvantages of different approaches. To this end, this paper evaluates the performance portability of a SYCL implementation of a large-scale cosmology application (CRK-HACC) running on GPUs from three different vendors: AMD, Intel, and NVIDIA. We detail the process of migrating the original code from CUDA to SYCL and show that specializing kernels for specific targets can greatly improve performance portability without significantly impacting programmer productivity. The SYCL version of CRK-HACC achieves a performance portability of 0.96 with a code divergence of almost 0, demonstrating that SYCL is a viable programming model for performance-portable applications.
翻译:第一代百亿亿次计算系统将包含多种机器架构,配备来自多家厂商的GPU。为此,许多开发者倾向于采用可移植编程模型,以避免维护多个代码版本。有必要记录此类编程模型的使用经验,以帮助开发者理解不同方法的优劣。基于此,本文评估了运行于AMD、Intel和NVIDIA三家厂商GPU上的大规模宇宙学应用CRK-HACC的SYCL实现的性能可移植性。我们详细阐述了将原始代码从CUDA迁移至SYCL的过程,并表明针对特定目标优化内核可大幅提升性能可移植性,且对程序员生产力影响甚微。CRK-HACC的SYCL版本实现了0.96的性能可移植性,代码差异几乎为零,这表明SYCL是适用于性能可移植应用的一种可行的编程模型。