High-speed chemically active flows present significant computational challenges due to their disparate space and time scales, where stiff chemistry often dominates simulation time. While modern supercomputing scientific codes achieve exascale performance by leveraging graphics processing units (GPUs), existing GPU-based compressible combustion solvers face critical limitations in memory management, load balancing, and handling the highly localized nature of chemical reactions. To this end, we present a high-performance compressible reacting flow solver built on the AMReX framework and optimized for multi-GPU settings. Our approach addresses three GPU performance bottlenecks: memory access patterns through column-major storage optimization, computational workload variability via a bulk-sparse integration strategy for chemical kinetics, and multi-GPU load distribution for adaptive mesh refinement applications. The solver adapts existing matrix-based chemical kinetics formulations to multigrid contexts. Using representative combustion applications including hydrogen-air detonations and jet in supersonic crossflow configurations, we demonstrate $2-5\times$ performance improvements over initial GPU implementations with near-ideal weak scaling across $1-96$ NVIDIA H100 GPUs. Roofline analysis reveals substantial improvements in arithmetic intensity for both convection ($\sim 10 \times$) and chemistry ($\sim 4 \times$) routines, confirming efficient utilization of GPU memory bandwidth and computational resources.
翻译:高速化学反应流动因其时空尺度差异显著而带来巨大的计算挑战,其中刚性化学动力学常主导模拟耗时。尽管现代超算科学代码通过利用图形处理器(GPU)实现了百亿亿次级性能,但现有的基于GPU的可压缩燃烧求解器在内存管理、负载平衡以及处理化学反应高度局部化特性方面仍面临关键局限。为此,我们提出了一种基于AMReX框架构建的高性能可压缩反应流求解器,并针对多GPU环境进行了优化。我们的方法解决了三个GPU性能瓶颈:通过列优先存储优化改善内存访问模式,采用面向化学动力学的批量稀疏积分策略应对计算负载可变性,以及为自适应网格细化应用设计多GPU负载分配方案。该求解器将现有的基于矩阵的化学动力学公式适配至多重网格场景。通过包括氢气-空气爆轰和超声速横流射流配置在内的典型燃烧应用,我们展示了相较于初始GPU实现2-5倍的性能提升,并在1-96个NVIDIA H100 GPU上实现了接近理想的弱扩展性能。屋顶线分析表明,对流(约10倍)和化学动力学(约4倍)子程序的算术强度均获得显著提升,证实了GPU内存带宽与计算资源的高效利用。