Particle-in-Cell (PIC) Monte Carlo (MC) simulations are central to plasma physics but face increasing challenges on heterogeneous HPC systems due to excessive data movement, synchronization overheads, and inefficient utilization of multiple accelerators. In this work, we present a portable, multi-GPU hybrid MPI+OpenMP implementation of BIT1 that enables scalable execution on both Nvidia and AMD accelerators through OpenMP target tasks with explicit dependencies to overlap computation and communication across devices. Portability is achieved through persistent device-resident memory, an optimized contiguous one-dimensional data layout, and a transition from unified to pinned host memory to improve large data-transfer efficiency, together with GPU Direct Memory Access (DMA) and runtime interoperability for direct device-pointer access. Standardized and scalable I/O is provided using openPMD and ADIOS2, supporting high-performance file I/O, in-memory data streaming, and in-situ analysis and visualization. Performance results on pre-exascale and exascale systems, including Frontier (OLCF-5) for up to 16,000 GPUs, demonstrate significant improvements in run time, scalability, and resource utilization for large-scale PIC MC simulations.
翻译:粒子-单元(PIC)蒙特卡罗(MC)模拟是等离子体物理的核心方法,但在异构高性能计算(HPC)系统中面临日益严峻的挑战,这源于过度的数据迁移、同步开销以及多个加速器的低效利用。本研究提出了一种基于BIT1的可移植多GPU混合MPI+OpenMP实现方案,通过利用具备显式依赖关系的OpenMP目标任务来跨设备重叠计算与通信,从而实现在Nvidia和AMD加速器上的可扩展执行。可移植性通过以下技术实现:采用持久性设备驻留内存、优化的一维连续数据布局、从统一内存到固定主机内存的转换以提升大数据传输效率,并结合GPU直接内存访问(DMA)和运行时互操作性实现直接设备指针访问。基于openPMD和ADIOS2提供了标准化且可扩展的输入输出(I/O),支持高性能文件I/O、内存数据流处理以及原位分析与可视化。在包含Frontier(OLCF-5)等前百亿亿级和百万亿亿级系统上的性能测试结果(最高使用16,000个GPU)表明,该方法在大规模PIC MC模拟的运行时间、可扩展性和资源利用率方面均有显著提升。