GPU First -- Execution of Legacy CPU Codes on GPUs

Utilizing GPUs is critical for high performance on heterogeneous systems. However, leveraging the full potential of GPUs for accelerating legacy CPU applications can be a challenging task for developers. The porting process requires identifying code regions amenable to acceleration, managing distinct memories, synchronizing host and device execution, and handling library functions that may not be directly executable on the device. This complexity makes it challenging for non-experts to leverage GPUs effectively, or even to start offloading parts of a large legacy application. In this paper, we propose a novel compilation scheme called "GPU First" that automatically compiles legacy CPU applications directly for GPUs without any modification of the application source. Library calls inside the application are either resolved through our partial libc GPU implementation or via automatically generated remote procedure calls to the host. Our approach simplifies the task of identifying code regions amenable to acceleration and enables rapid testing of code modifications on actual GPU hardware in order to guide porting efforts. Our evaluation on two HPC proxy applications with OpenMP CPU and GPU parallelism, four micro benchmarks with originally GPU only parallelism, as well as three benchmarks from the SPEC OMP 2012 suite featuring hand-optimized OpenMP CPU parallelism showcases the simplicity of porting host applications to the GPU. For existing parallel loops, we often match the performance of corresponding manually offloaded kernels, with up to 14.36x speedup on the GPU, validating that our GPU First methodology can effectively guide porting efforts of large legacy applications.

翻译：利用GPU对于在异构系统上实现高性能至关重要。然而，充分发挥GPU加速遗留CPU应用程序的潜力对开发者而言是一项具有挑战性的任务。移植过程需要识别适合加速的代码区域、管理独立的内存、同步主机与设备的执行，以及处理可能无法在设备上直接执行的库函数。这种复杂性使得非专业人士难以有效利用GPU，甚至难以开始将大型遗留应用程序的部分功能进行卸载。在本文中，我们提出了一种名为“GPU优先”的新型编译方案，该方案无需修改应用程序源代码即可自动将遗留CPU应用程序直接编译至GPU。应用程序中的库调用要么通过我们部分实现的libc GPU版本解决，要么通过自动生成的远程过程调用发送至主机。我们的方法简化了识别适合加速代码区域的任务，并能够在实际GPU硬件上快速测试代码修改，从而指导移植工作。通过对两个采用OpenMP CPU和GPU并行的高性能计算代理应用程序、四个原本仅支持GPU并行的微基准测试，以及来自SPEC OMP 2012测试套件中三个具有手动优化OpenMP CPU并行的基准测试进行评估，我们展示了将主机应用程序移植至GPU的简便性。对于现有的并行循环，我们通常能够匹配相应手动卸载内核的性能，在GPU上实现高达14.36倍的加速比，验证了我们的“GPU优先”方法能有效指导大型遗留应用程序的移植工作。