GPU First -- Execution of Legacy CPU Codes on GPUs

Utilizing GPUs is critical for high performance on heterogeneous systems. However, leveraging the full potential of GPUs for accelerating legacy CPU applications can be a challenging task for developers. The porting process requires identifying code regions amenable to acceleration, managing distinct memories, synchronizing host and device execution, and handling library functions that may not be directly executable on the device. This complexity makes it challenging for non-experts to leverage GPUs effectively, or even to start offloading parts of a large legacy application. In this paper, we propose a novel compilation scheme called "GPU First" that automatically compiles legacy CPU applications directly for GPUs without any modification of the application source. Library calls inside the application are either resolved through our partial libc GPU implementation or via automatically generated remote procedure calls to the host. Our approach simplifies the task of identifying code regions amenable to acceleration and enables rapid testing of code modifications on actual GPU hardware in order to guide porting efforts. Our evaluation on two HPC proxy applications with OpenMP CPU and GPU parallelism, four micro benchmarks with originally GPU only parallelism, as well as three benchmarks from the SPEC OMP 2012 suite featuring hand-optimized OpenMP CPU parallelism showcases the simplicity of porting host applications to the GPU. For existing parallel loops, we often match the performance of corresponding manually offloaded kernels, with up to 14.36x speedup on the GPU, validating that our GPU First methodology can effectively guide porting efforts of large legacy applications.

翻译：利用GPU对于在异构系统上实现高性能至关重要。然而，充分发挥GPU加速遗留CPU应用程序的潜力对开发者而言是一项具有挑战性的任务。移植过程需要识别适合加速的代码区域、管理独立的内存、同步主机与设备的执行，以及处理可能无法在设备上直接执行的库函数。这种复杂性使得非专家用户难以有效利用GPU，甚至难以启动大型遗留应用程序的部分卸载工作。在本文中，我们提出一种名为“GPU First”的新型编译方案，该方案无需修改应用程序源代码，即可自动将遗留CPU应用程序直接编译到GPU上。应用程序中的库调用通过我们部分实现的libc GPU实现或自动生成的主机远程过程调用来解决。我们的方法简化了识别可加速代码区域的任务，并能够在实际GPU硬件上快速测试代码修改，从而指导移植工作。我们对两个采用OpenMP CPU和GPU并行性的HPC代理应用程序、四个原本仅支持GPU并行性的微基准测试，以及来自SPEC OMP 2012基准套件中三个具有手动优化OpenMP CPU并行性的基准程序进行了评估，展示了将主机应用程序移植到GPU的简便性。对于现有的并行循环，我们的性能通常与相应手动卸载的内核相当，在GPU上实现了高达14.36倍的加速，验证了我们的GPU First方法能够有效指导大型遗留应用程序的移植工作。