We implement and analyse a sparse / indirect-addressing data structure for the Lattice Boltzmann Method to support efficient compute kernels for fluid dynamics problems with a high number of non-fluid nodes in the domain, such as in porous media flows. The data structure is integrated into a code generation pipeline to enable sparse Lattice Boltzmann Methods with a variety of stencils and collision operators and to generate efficient code for kernels for CPU as well as for AMD and NVIDIA accelerator cards. We optimize these sparse kernels with an in-place streaming pattern to save memory accesses and memory consumption and we implement a communication hiding technique to prove scalability. We present single GPU performance results with up to 99% of maximal bandwidth utilization. We integrate the optimized generated kernels in the high performance framework WALBERLA and achieve a scaling efficiency of at least 82% on up to 1024 NVIDIA A100 GPUs and up to 4096 AMD MI250X GPUs on modern HPC systems. Further, we set up three different applications to test the sparse data structure for realistic demonstrator problems. We show performance results for flow through porous media, free flow over a particle bed, and blood flow in a coronary artery. We achieve a maximal performance speed-up of 2 and a significantly reduced memory consumption by up to 75% with the sparse / indirect-addressing data structure compared to the direct-addressing data structure for these applications.
翻译:我们为格子玻尔兹曼方法实现并分析了一种稀疏/间接寻址数据结构,以支持在计算域内包含大量非流体节点(如多孔介质流动)的流体动力学问题的高效计算内核。该数据结构被集成到一个代码生成流水线中,旨在实现支持多种离散速度模型和碰撞算子的稀疏格子玻尔兹曼方法,并为CPU以及AMD和NVIDIA加速卡生成高效的内核代码。我们采用原地流模式对这些稀疏内核进行优化,以减少内存访问和内存消耗,并实现了一种通信隐藏技术以证明其可扩展性。我们展示了单GPU性能结果,其最大带宽利用率高达99%。我们将优化后的生成内核集成到高性能框架WALBERLA中,在现代高性能计算系统上,于多达1024个NVIDIA A100 GPU和多达4096个AMD MI250X GPU上实现了至少82%的扩展效率。此外,我们建立了三个不同的应用案例,以测试该稀疏数据结构在现实演示问题中的表现。我们展示了多孔介质渗流、颗粒床层上的自由流动以及冠状动脉内血液流动的性能结果。与直接寻址数据结构相比,在这些应用中,稀疏/间接寻址数据结构实现了最高2倍的性能加速,并将内存消耗显著降低了高达75%。