POLAR-PIC: A Holistic Framework for Matrixized PIC with Co-Designed Compute, Layout, and Communication

Particle-in-Cell (PIC) simulations are fundamental to plasma physics but often suffer from limited scalability due to particle-grid interaction bottlenecks and particle redistribution costs. Specifically, the particle-grid interaction computations have not taken full advantage of the emerging Matrix Processing Units (MPUs), the particle motion introduces irregular memory accesses, and the bulk-synchronous redistribution further destroys long-term data locality thereby limiting parallel efficiency. To address these inefficiencies, we present POLAR-PIC, a co-designed framework for large-scale PIC simulations that (i) reformulates Field Interpolation into an MPU-friendly outer-product form, (ii) maintains a physically ordered particle layout to preserve memory contiguity, and (iii) overlaps particle communication with Deposition to hide redistribution overhead. The evaluation on the pilot system of an Exascale supercomputer demonstrates that POLAR-PIC accelerates the entire particle-processing phase by up to 10.9x in uniform plasma and 4.4x in real-world laser-ion acceleration scenarios compared to the native WarpX reference pipeline on LX2. Ablation studies reveal that the speedups achieved by Interpolation and Deposition are 8.0x and 13.2x, respectively, and the asynchronous communication design sustains a 99.1% overlap ratio. In cross-platform comparisons, POLAR-PIC achieves 13.2% of theoretical peak efficiency on the CPU-based LS system, while WarpX reaches 9.6% on NVIDIA A800 GPUs. Notably, the scalability evaluation demonstrates that POLAR-PIC maintains 67.5% weak scaling efficiency on over 2 million cores under high-migration dynamic workloads, highlighting the importance of holistic co-design for future matrix-centric HPC systems.

翻译：粒子网格法（PIC）模拟是等离子体物理的基础方法，但常因粒子-网格交互瓶颈和粒子重分布代价导致可扩展性受限。具体而言，粒子-网格交互计算未能充分利用新兴矩阵处理单元（MPU），粒子运动引发非规则内存访问，而批量同步重分布进一步破坏长期数据局部性从而限制并行效率。为解决这些低效问题，我们提出POLAR-PIC——面向大规模PIC模拟的协同设计框架，该框架（i）将场插值重构为MPU友好的外积形式，（ii）维持物理有序的粒子布局以保持内存连续性，（iii）将粒子通信与沉积重叠以隐藏重分布开销。在百亿亿次超级计算机原型的评估表明：相较于LX2上原生WarpX参考管线，POLAR-PIC在均匀等离子体中实现整个粒子处理阶段最高10.9倍加速，在真实激光离子加速场景中实现4.4倍加速。消融实验显示插值与沉积分别获得8.0倍和13.2倍加速比，异步通信设计保持99.1%的重叠率。跨平台对比中，POLAR-PIC在基于CPU的LS系统上达到理论峰值效率的13.2%，而WarpX在NVIDIA A800 GPU上达到9.6%。值得注意的是，可扩展性评估表明POLAR-PIC在超过200万核的高迁移动态工作负载下保持67.5%的弱扩展效率，凸显了面向未来矩阵中心型HPC系统进行全方位协同设计的重要性。