Particle-in-Cell (PIC) Monte Carlo (MC) simulations are central to plasma physics but face increasing challenges on heterogeneous HPC systems due to excessive data movement, synchronization overheads, and inefficient utilization of multiple accelerators. In this work, we present a portable, multi-GPU hybrid MPI+OpenMP implementation of BIT1 that enables scalable execution on both Nvidia and AMD accelerators through OpenMP target tasks with explicit dependencies to overlap computation and communication across devices. Portability is achieved through persistent device-resident memory, an optimized contiguous one-dimensional data layout, and a transition from unified to pinned host memory to improve large data-transfer efficiency, together with GPU Direct Memory Access (DMA) and runtime interoperability for direct device-pointer access. Standardized and scalable I/O is provided using openPMD and ADIOS2, supporting high-performance file I/O, in-memory data streaming, and in-situ analysis and visualization. Performance results on pre-exascale and exascale systems, including Frontier (OLCF-5) for up to 16,000 GPUs, demonstrate significant improvements in run time, scalability, and resource utilization for large-scale PIC MC simulations.
翻译:粒子网格(PIC)蒙特卡罗(MC)仿真是等离子体物理的核心方法,但在异构高性能计算(HPC)系统上面临数据移动量过大、同步开销过高及多加速器利用率低下等日益严峻的挑战。本文提出了一种可移植的多GPU混合MPI+OpenMP实现方案,该方案基于BIT1程序,利用具有显式依赖关系的OpenMP目标任务实现设备间计算与通信重叠,从而支持在Nvidia和AMD加速器上可扩展执行。可移植性通过以下技术实现:持久化设备驻留内存、优化的连续一维数据布局、从统一内存到固定主机内存的转换以提升大数据传输效率,以及GPU直接内存访问(DMA)和运行时互操作性以实现直接设备指针访问。标准化且可扩展的输入输出采用openPMD和ADIOS2,支持高性能文件I/O、内存数据流传输以及原位分析与可视化。在预百亿亿次及百亿亿次系统(包括使用多达16,000个GPU的Frontier (OLCF-5))上的性能结果表明,该方案在大型PIC MC仿真的运行时间、可扩展性和资源利用率方面均有显著提升。