Matrix-multiply-accumulate (MMA) units, or tensor cores, are now widespread across modern computing architectures. Yet, their use for particle-grid operators remains limited. In implicit particle methods, mass-matrix assembly is a reduction-dominated kernel in which weighted outer products of interpolation weights are accumulated over particle support. We show that this operation can be reformulated exactly, cell by cell, as a sequence of matrix products matched to hardware MMA tiles. The formulation is general with respect to interpolation order and hardware platform, and applies to both scalar mass matrices and the tensorial block mass matrix arising in implicit in the Energy-Conserving Semi-Implicit Method (ECSIM) for Particle-in-Cell simulations. We introduce particle batching and a support-group decomposition for higher-order shape functions whose stencil extends beyond a single cell, specialize the method to first- and second-order B-spline interpolation, and implement it on NVIDIA tensor cores. The resulting kernels achieve up to 3x over optimized conventional implementations and reduce end-to-end ECSIM runtime by 15%.
翻译:矩阵乘积累加(MMA)单元(即张量核心)现已广泛存在于现代计算架构中,但其在粒子-网格算子中的应用仍十分有限。在隐式粒子方法中,质量矩阵组装是一个以归约为主导的内核,其中插值权重的加权外积在粒子支撑域上累积。我们证明,该操作可精准地按单元重新表述为与硬件MMA块匹配的矩阵乘积序列。该公式适用于任意插值阶数和硬件平台,同时适用于标量质量矩阵和隐式能量守恒半隐式方法(ECSIM)中出现的张量块质量矩阵。针对模板延伸至单个单元之外的高阶形状函数,我们引入了粒子批处理与支撑组分解方法,将方法特化至一阶和二阶B样条插值,并在NVIDIA张量核心上实现。生成的内核性能较优化传统实现提升高达3倍,并将端到端ECSIM运行时间减少15%。