Multiple-GPU accelerated high-order gas-kinetic scheme on three-dimensional unstructured meshes

Recently, successes have been achieved for the high-order gas-kinetic schemes (HGKS) on unstructured meshes for compressible flows. In this paper, to accelerate the computation, HGKS is implemented with the graphical processing unit (GPU) using the compute unified device architecture (CUDA). HGKS on unstructured meshes is a fully explicit scheme, and the acceleration framework can be developed based on the cell-level parallelism. For single-GPU computation, the connectivity of geometric information is generated for the requirement of data localization and independence. Based on such data structure, the kernels and corresponding girds of CUDA are set. With the one-to-one mapping between the indices of cells and CUDA threads, the single-GPU computation using CUDA can be implemented for HGKS. For multiple-GPU computation, the domain decomposition and data exchange need to be taken into account. The domain is decomposed into subdomains by METIS, and the MPI processes are created for the control of each process and communication among GPUs. With reconstruction of connectivity and adding ghost cells, the main configuration of CUDA for single-GPU can be inherited by each GPU. The benchmark cases for compressible flows, including accuracy test and flow passing through a sphere, are presented to assess the numerical performance of HGKS with Nvidia RTX A5000 and Tesla V100 GPUs. For single-GPU computation, compared with the parallel central processing unit (CPU) code running on the Intel Xeon Gold 5120 CPU with open multi-processing (OpenMP) directives, 5x speedup is achieved by RTX A5000 and 9x speedup is achieved by Tesla V100. For multiple-GPU computation, HGKS code scales properly with the increasing number of GPU. Numerical results confirm the excellent performance of multiple-GPU accelerated HGKS on unstructured meshes.

翻译：近年来，高阶气体动理学格式在可压缩流动的非结构网格计算中取得了成功。本文为加速计算，采用统一计算设备架构在高性能图形处理器上实现了高阶气体动理学格式。非结构网格上的高阶气体动理学格式为全显式格式，其加速框架可基于单元级并行性构建。对于单GPU计算，为满足数据局部性与独立性的要求，需生成几何信息的连接关系。基于此数据结构，设置CUDA内核函数及其网格配置。通过建立单元索引与CUDA线程的一一映射关系，可实现基于CUDA的单GPU高阶气体动理学格式计算。对于多GPU计算，需考虑区域分解与数据交换。采用METIS进行区域分解，并创建MPI进程以控制各进程及GPU间通信。通过重构连接关系与添加虚拟单元，各GPU可继承单GPU的CUDA主体配置。通过可压缩流动基准算例（包括精度验证与球体绕流）评估了采用英伟达RTX A5000与Tesla V100 GPU的高阶气体动理学格式数值性能。单GPU计算中，相较于在英特尔至强金牌5120处理器上运行基于OpenMP指令的并行CPU代码，RTX A5000实现了5倍加速，Tesla V100实现了9倍加速。多GPU计算中，高阶气体动理学格式代码能随GPU数量增加而良好扩展。数值结果证实了多GPU加速的非结构网格高阶气体动理学格式具有卓越性能。