PLUTO代码在GPU上的实现：卸载拉格朗日粒子方法 (The PLUTO Code on GPUs: Offloading Lagrangian Particle Methods)

from arxiv, Published in Astronomy and Computing. Special issue: Advancing Cosmology and Astrophysics through High-Performance Computing and Machine Learning

The Lagrangian Particles (LP) module of the PLUTO code offers a powerful simulation tool to predict the non-thermal emission produced by shock accelerated particles in large-scale relativistic magnetized astrophysics flows. The LPs represent ensembles of relativistic particles with a given energy distribution which is updated by solving the relativistic cosmic ray transport equation. The approach consistently includes the effects of adiabatic expansion, synchrotron and inverse Compton emission. The large scale nature of such systems creates boundless computational demand which can only be satisfied by targeting modern computing hardware such as Graphic Processing Units (GPUs). In this work we presents the GPU-compatible C++ re-design of the LP module, that, by means of the programming model OpenACC and the Message Passing Interface library, is capable of targeting both single commercial GPUs as well as multi-node (pre-)exascale computing facilities. The code has been benchmarked up to 28672 parallel CPUs cores and 1024 parallel GPUs demonstrating $\sim(80-90)\%$ weak scaling parallel efficiency and good strong scaling capabilities. Our results demonstrated a speedup of $6$ times when solving that same benchmark test with 128 full GPU nodes (4GPUs per node) against the same amount of full high-end CPU nodes (112 cores per node). Furthermore, we conducted a code verification by comparing its prediction to corresponding analytical solutions for two test cases. We note that this work is part of broader project that aims at developing gPLUTO, the novel and revised GPU-ready implementation of its legacy.

翻译：PLUTO代码的拉格朗日粒子（LP）模块为预测大规模相对论磁化天体物理流中激波加速粒子产生的非热辐射提供了强大的模拟工具。LP模块通过求解相对论宇宙射线输运方程，表征具有给定能量分布的相对论粒子系综并实时更新其分布。该方法自洽地包含了绝热膨胀、同步辐射及逆康普顿辐射等效应。此类系统的大尺度特性产生了无限的计算需求，唯有借助图形处理器（GPU）等现代计算硬件才能满足。本研究提出了LP模块的GPU兼容性C++重构方案，该方案通过OpenACC编程模型与消息传递接口库，能够同时适配单台商用GPU及多节点（预）百亿亿次计算设施。代码已在高达28672个并行CPU核心和1024个并行GPU的规模上进行基准测试，展现出约(80-90)%的弱扩展并行效率与良好的强扩展能力。实验结果表明，在128个全GPU节点（每节点4个GPU）上求解相同基准测试问题时，相较于同等规模的全高端CPU节点（每节点112个核心），计算速度提升了6倍。此外，我们通过两个测试案例将代码预测结果与对应解析解进行对比，完成了代码验证。需要说明的是，本研究属于更广泛的gPLUTO开发项目的一部分，该项目旨在构建其传统代码的全新GPU就绪型修订版本。