The ensemble data assimilation of computational fluid dynamics simulations based on the lattice Boltzmann method (LBM) and the local ensemble transform Kalman filter (LETKF) is implemented and optimized on a GPU supercomputer based on NVIDIA A100 GPUs. To connect the LBM and LETKF parts, data transpose communication is optimized by overlapping computation, file I/O, and communication based on data dependency in each LETKF kernel. In two dimensional forced isotropic turbulence simulations with the ensemble size of $M=64$ and the number of grid points of $N_x=128^2$, the optimized implementation achieved $\times3.80$ speedup from the naive implementation, in which the LETKF part is not parallelized. The main computing kernel of the local problem is the eigenvalue decomposition (EVD) of $M\times M$ real symmetric dense matrices, which is computed by a newly developed batched EVD in $\verb|EigenG|$. The batched EVD in $\verb|EigenG|$ outperforms that in $\verb|cuSOLVER|$, and $\times65.3$ speedup was achieved.
翻译:基于格子玻尔兹曼方法(LBM)和局部集合变换卡尔曼滤波(LETKF)的计算流体动力学集合数据同化,在基于NVIDIA A100 GPU的超算上实现并优化。为连接LBM和LETKF部分,通过基于各LETKF内核中数据依赖关系重叠计算、文件I/O和通信,优化了数据转置通信。在集合规模为$M=64$、网格点数为$N_x=128^2$的二维受迫各向同性湍流模拟中,优化后的实现相比未并行化LETKF部分的朴素实现获得了$\times3.80$的加速比。局部问题的主计算内核是$M\times M$实对称稠密矩阵的特征值分解(EVD),该计算由新开发的$\verb|EigenG|$批处理EVD完成。$\verb|EigenG|$中的批处理EVD性能优于$\verb|cuSOLVER|$中的实现,获得了$\times65.3$的加速比。