In this paper, we evaluate the portability of the SYCL programming model on some of the latest CPUs and GPUs from a wide range of vendors, utilizing the two main compilers: DPC++ and hipSYCL/OpenSYCL. Both compilers currently support GPUs from all three major vendors; we evaluate performance on the Intel(R) Data Center GPU Max 1100, the NVIDIA A100 GPU, and the AMD MI250X GPU. Support on CPUs currently is less established, with DPC++ only supporting x86 CPUs through OpenCL, however, OpenSYCL does have an OpenMP backend capable of targeting all modern CPUs; we benchmark the Intel Xeon Platinum 8360Y Processor (Ice Lake), the AMD EPYC 9V33X (Genoa-X), and the Ampere Altra platforms. We study a range of primarily bandwidth-bound applications implemented using the OPS and OP2 DSLs, evaluate different formulations in SYCL, and contrast their performance to "native" programming approaches where available (CUDA/HIP/OpenMP). On GPU architectures SCYL on average even slightly outperforms native approaches, while on CPUs it falls behind - highlighting a continued need for improving CPU performance. While SYCL does not solve all the challenges of performance portability (e.g. needing different algorithms on different hardware), it does provide a single programming model and ecosystem to target most current HPC architectures productively.
翻译:本文评估了在多家厂商最新CPU和GPU上SYCL编程模型的可移植性,采用了两种主流编译器:DPC++和hipSYCL/OpenSYCL。这两种编译器目前均支持三大主要厂商的GPU;我们在Intel(R) Data Center GPU Max 1100、NVIDIA A100 GPU以及AMD MI250X GPU上评估了性能表现。当前CPU支持的成熟度相对较低,DPC++仅通过OpenCL支持x86 CPU,而OpenSYCL具备面向所有现代CPU的OpenMP后端;我们对比测试了Intel Xeon Platinum 8360Y处理器(Ice Lake)、AMD EPYC 9V33X(Genoa-X)以及Ampere Altra平台。我们研究了基于OPS和OP2 DSL实现的一系列以带宽受限为主的应用,评估了SYCL中不同形式化方法的性能,并将其与可用的"原生"编程方法(CUDA/HIP/OpenMP)进行对比。在GPU架构上,SYCL平均表现甚至略优于原生方法,但在CPU上则落后——这凸显了仍需持续提升CPU性能的需求。尽管SYCL未能解决性能可移植性的全部挑战(例如不同硬件需要采用不同算法),但它确实为高效适配当前大多数HPC架构提供了统一的编程模型和生态系统。