Open SYCL on heterogeneous GPU systems: A case of study

Computational platforms for high-performance scientific applications are becoming more heterogenous, including hardware accelerators such as multiple GPUs. Applications in a wide variety of scientific fields require an efficient and careful management of the computational resources of this type of hardware to obtain the best possible performance. However, there are currently different GPU vendors, architectures and families that can be found in heterogeneous clusters or machines. Programming with the vendor provided languages or frameworks, and optimizing for specific devices, may become cumbersome and compromise portability to other systems. To overcome this problem, several proposals for high-level heterogeneous programming have appeared, trying to reduce the development effort and increase functional and performance portability, specifically when using GPU hardware accelerators. This paper evaluates the SYCL programming model, using the Open SYCL compiler, from two different perspectives: The performance it offers when dealing with single or multiple GPU devices from the same or different vendors, and the development effort required to implement the code. We use as case of study the Finite Time Lyapunov Exponent calculation over two real-world scenarios and compare the performance and the development effort of its Open SYCL-based version against the equivalent versions that use CUDA or HIP. Based on the experimental results, we observe that the use of SYCL does not lead to a remarkable overhead in terms of the GPU kernels execution time. In general terms, the Open SYCL development effort for the host code is lower than that observed with CUDA or HIP. Moreover, the SYCL version can take advantage of both CUDA and AMD GPU devices simultaneously much easier than directly using the vendor-specific programming solutions.

翻译：高性能科学应用的计算平台正变得越来越异构化，包括多GPU等硬件加速器。众多科学领域的应用需要高效且精细地管理此类硬件的计算资源，以获得最佳性能。然而，当前异构集群或系统中存在不同GPU厂商、架构和系列。使用厂商提供的语言或框架进行编程，并针对特定设备进行优化，可能变得繁琐并损害向其他系统的可移植性。为解决此问题，业界提出了多种高层异构编程方案，旨在减少开发工作量并提升功能与性能可移植性，尤其是在使用GPU硬件加速器时。本文从两个角度评估基于Open SYCL编译器的SYCL编程模型：处理来自同一或不同厂商的单/多GPU设备的性能表现，以及实现代码所需的开发工作量。我们以两个真实场景中的有限时间李雅普诺夫指数计算为案例，并将其基于Open SYCL版本的性能与开发工作量，与使用CUDA或HIP的等效版本进行对比。实验结果表明，使用SYCL不会显著增加GPU内核执行时间的开销。总体而言，Open SYCL在主机代码上的开发工作量低于CUDA或HIP。此外，SYCL版本能更简便地同时利用CUDA和AMD GPU设备，而直接使用厂商专属编程方案则难以实现。