基于C++标准并行化的PALABOS流体求解器多GPU加速 (Multi-GPU Acceleration of PALABOS Fluid Solver using C++ Standard Parallelism)

This article presents the principles, software architecture, and performance analysis of the GPU port of the lattice Boltzmann software library Palabos (J. Latt et al., "Palabos: Parallel lattice Boltzmann solver", Comput. Math. Appl. 81, 334-350, (2021)). A hybrid CPU-GPU execution model is adopted, in which numerical components are selectively assigned to either the CPU or the GPU, depending on considerations of performance or convenience. This design enables a progressive porting strategy, allowing most features of the original CPU-based codebase to be gradually and seamlessly adapted to GPU execution. The new architecture builds upon two complementary paradigms: a classical object-oriented structure for CPU execution, and a data-oriented counterpart for GPUs, which reproduces the modularity of the original code while eliminating object-oriented overhead detrimental to GPU performance. Central to this approach is the use of modern C++, including standard parallel algorithms and template metaprogramming techniques, which permit the generation of hardware-agnostic computational kernels. This facilitates the development of user-defined, GPU-accelerated components such as collision operators or boundary conditions, while preserving compatibility with the existing codebase and avoiding the need for external libraries or non-standard language extensions. The correctness and performance of the GPU-enabled Palabos are demonstrated through a series of three-dimensional multiphysics benchmarks, including the laminar-turbulent transition in a Taylor-Green vortex, lid-driven cavity flow, and pore-scale flow in Berea sandstone. Despite the high-level abstraction of the implementation, the single-GPU performance is similar to CUDA-native solvers, and multi-GPU tests exhibit good weak and strong scaling across all test cases.

翻译：本文阐述了晶格玻尔兹曼软件库Palabos（J. Latt等人，《Palabos：并行晶格玻尔兹曼求解器》，Comput. Math. Appl. 81, 334-350, (2021)）GPU移植版本的原理、软件架构与性能分析。研究采用了一种CPU-GPU混合执行模型，其中数值计算组件根据性能或便利性考量被选择性地分配至CPU或GPU执行。该设计支持渐进式移植策略，使得原始基于CPU的代码库的大部分功能能够逐步且无缝地适配GPU执行。新架构建立在两种互补的范式之上：用于CPU执行的经典面向对象结构，以及用于GPU的面向数据对应结构，后者在复现原始代码模块化的同时，消除了不利于GPU性能的面向对象开销。此方法的核心在于运用现代C++特性，包括标准并行算法与模板元编程技术，从而能够生成与硬件无关的计算内核。这便于开发用户定义的、GPU加速的组件（如碰撞算子或边界条件），同时保持与现有代码库的兼容性，且无需依赖外部库或非标准语言扩展。通过一系列三维多物理场基准测试（包括泰勒-格林涡旋中的层流-湍流转捩、盖驱动空腔流以及Berea砂岩中的孔隙尺度流动），验证了支持GPU的Palabos的正确性与性能。尽管实现具有高层抽象性，但其单GPU性能与原生CUDA求解器相近，且多GPU测试在所有案例中均展现出良好的弱可扩展性与强可扩展性。