Matrix-PIC: Harnessing Matrix Outer-product for High-Performance Particle-in-Cell Simulations

Particle-in-Cell (PIC) simulations spend most of their execution time on particle--grid interactions, where fine-grained atomic updates become a major bottleneck on traditional many-core CPUs. Recent CPU architectures integrate specialized Matrix Processing Units (MPUs) that efficiently support matrix outer-product operations, offering new opportunities to overcome this limitation. Leveraging this architectural shift, this work focuses on redesigning the current deposition step of PIC simulations under a matrix-centric execution model. We present MatrixPIC, the first holistic co-design of the deposition kernel, data layout, and incremental particle sorting tailored to the hybrid MPU--VPU SIMD model on modern CPUs. MatrixPIC introduces: (i)~a block-matrix formulation of the current deposition algorithm that maps naturally to MPU outer-product primitives; (ii)~a hybrid execution pipeline that combines MPU-based high-density accumulation with VPU-based data preparation and control flow; and (iii)~an $O(1)$-amortized incremental sorter based on a gapped packed-memory array to preserve data locality for efficient MPU execution. Evaluated on a next-generation HPC platform, MatrixPIC achieves significant performance gains. In Laser-Wakefield Acceleration (LWFA) simulations, it delivers up to $2.63\times$ speedup in total runtime. For third-order deposition, the core kernel is accelerated by $8.7\times$ over the baseline and $2.0\times$ over the best hand-optimized VPU implementation. Moreover, MatrixPIC reaches $83.08\%$ of theoretical CPU peak performance, nearly $2.8\times$ higher than a highly optimized CUDA kernel on a data center GPU. These results demonstrate the effectiveness of matrix-oriented co-design for accelerating PIC simulations on emerging CPU architectures.

翻译：粒子网格（Particle-in-Cell, PIC）模拟将大部分执行时间消耗在粒子-网格相互作用上，其中细粒度的原子更新成为传统多核CPU上的主要性能瓶颈。近期CPU架构集成了专用的矩阵处理单元（Matrix Processing Units, MPUs），能够高效支持矩阵外积运算，为克服这一限制提供了新的机遇。借助这一架构变革，本研究聚焦于在矩阵中心执行模型下重新设计PIC模拟的电流沉积步骤。我们提出了MatrixPIC，这是首个针对现代CPU上混合MPU-VPU SIMD模型，对沉积内核、数据布局和增量粒子排序进行整体协同设计的方案。MatrixPIC引入了：（i）一种基于块矩阵的电流沉积算法形式化描述，可自然映射到MPU外积原语；（ii）一种混合执行流水线，结合了基于MPU的高密度累加与基于VPU的数据准备和控制流；（iii）一种基于间隙填充内存数组的、摊销复杂度为$O(1)$的增量排序器，以保持数据局部性，实现高效的MPU执行。在下一代高性能计算平台上进行评估，MatrixPIC取得了显著的性能提升。在激光尾波场加速（Laser-Wakefield Acceleration, LWFA）模拟中，其总运行时间最高可加速$2.63\times$。对于三阶沉积，核心内核相比基线实现加速$8.7\times$，相比最佳手动优化的VPU实现加速$2.0\times$。此外，MatrixPIC达到了理论CPU峰值性能的$83.08\%$，比数据中心GPU上高度优化的CUDA内核高出近$2.8\times$。这些结果证明了面向矩阵的协同设计在加速新兴CPU架构上PIC模拟方面的有效性。