Particle-in-Cell (PIC) Monte Carlo (MC) simulations are central to plasma physics but face increasing challenges on heterogeneous HPC systems due to excessive data movement, synchronization overheads, and inefficient utilization of multiple accelerators. In this work, we present a portable, multi-GPU hybrid MPI+OpenMP implementation of BIT1 that enables scalable execution on both Nvidia and AMD accelerators through OpenMP target tasks with explicit dependencies to overlap computation and communication across devices. Portability is achieved through persistent device-resident memory, an optimized contiguous one-dimensional data layout, and a transition from unified to pinned host memory to improve large data-transfer efficiency, together with GPU Direct Memory Access (DMA) and runtime interoperability for direct device-pointer access. Standardized and scalable I/O is provided using openPMD and ADIOS2, supporting high-performance file I/O, in-memory data streaming, and in-situ analysis and visualization. Performance results on pre-exascale and exascale systems, including Frontier (OLCF-5) for up to 16,000 GPUs, demonstrate significant improvements in run time, scalability, and resource utilization for large-scale PIC MC simulations.
翻译:粒子模拟(PIC)蒙特卡罗(MC)方法在等离子体物理中具有核心地位,但在异构高性能计算系统上因数据移动过度、同步开销以及多加速器利用率低下而面临日益严峻的挑战。本文提出一种基于MPI+OpenMP混合架构的可移植多GPU实现BIT1,通过具有显式依赖的OpenMP目标任务实现跨设备计算与通信重叠,支持在Nvidia和AMD加速器上进行可扩展执行。可移植性通过以下方式实现:持久化设备驻留内存、优化的连续一维数据布局、从统一内存向固定主机内存的迁移以提升大数据传输效率,结合GPU直接内存访问(DMA)及运行时互操作性实现设备指针直访。采用openPMD与ADIOS2提供标准化可扩展I/O,支持高性能文件读写、内存数据流以及原位分析与可视化。在预百亿亿次和百亿亿次系统(包括Frontier(OLCF-5)上高达16,000个GPU)的性能测试表明,该方法在运行时间、可扩展性和资源利用效率方面均实现了显著提升。