Task-parallelism in SWIFT for heterogeneous compute architectures

This paper highlights first steps towards enabling graphics processing unit (GPU) acceleration of the task-parallel smoothed particle hydrodynamics (SPH) solver SWIFT. Novel combinations of algorithms are presented, enabling SWIFT to function as a truly heterogeneous software leveraging task-parallelism on CPUs for memory-bound computations concurrently with GPUs for compute-bound computations while minimising the effects of CPU-GPU communication latency. The proposed algorithms are validated in extensive testing. The GPU acceleration methodology is shown to deliver up to 3.5 and 7.5 speedups for the offloaded computations when including and excluding the time required to prepare and post-process data transfers on the CPU side, respectively. The overall performance of the GPU-accelerated hydrodynamic solver for a full simulation on a single Grace-Hopper superchip is 1.8 times faster compared to the superchips fully parallelised CPU capabilities. This constitutes an improvement from 8 million particle updates/s for the full CPU-only baseline (115,000 updates per CPU core) to 15 million updates/s for the GPU-accelerated SPH solver. Moreover, it displays near-perfect strong scaling on 4 Grace-Hopper nodes. The GPU-acceleration is also demonstrated to give a 29 percent improvement in energy efficiency in comparison to CPU-only baselines. Finally, inter-influential bottlenecks in the prototype solver presented in this work are identified: A significant amount of time (up to 80 percent) of a GPU-offloading cycle is spent on preparing and post-processing particle data on the CPU for the transfer to and from the GPU, respectively. Approaches are suggested to minimise their effects and maximise the solver's performance in our future work.

翻译：本文重点介绍了在实现任务并行平滑粒子流体动力学（SPH）求解器SWIFT的图形处理器（GPU）加速方面迈出的初步步伐。我们提出了新颖的算法组合，使SWIFT能够作为真正的异构软件运行：利用CPU上的任务并行性处理内存受限计算，同时利用GPU处理计算受限计算，并最大限度地减少CPU-GPU通信延迟的影响。所提出的算法经过广泛测试验证。GPU加速方法在包含与排除CPU端数据准备及后处理传输时间的情况下，卸载计算分别实现了最高3.5倍和7.5倍的加速比。在单个Grace-Hopper超级芯片上进行完整模拟时，GPU加速流体动力学求解器的整体性能相比超级芯片完全并行化的CPU性能提升1.8倍。这使粒子更新速率从纯CPU基准的每秒800万次更新（每CPU核心11.5万次）提升至GPU加速SPH求解器的每秒1500万次更新。此外，该系统在4个Grace-Hopper节点上展现出近乎完美的强扩展性。与纯CPU基准相比，GPU加速还能提升29%的能效。最后，本文指出了原型求解器中存在的相互影响瓶颈：在GPU卸载周期中，高达80%的时间用于在CPU端准备粒子数据以供传输至GPU，以及处理从GPU返回的数据。我们提出了相应方法以最小化这些瓶颈的影响，并在未来工作中最大化求解器性能。