MFEM is a widely used finite-element library, but its native linear-elasticity Partial Assembly (PA) path still applies an $O((p+1)^6)$ contraction in the element operator, leaving the CPU operator-throughput sweet spot near $p\approx 2$ in our baseline measurements. This work closes this implementation gap for MFEM linear elasticity on affine tensor-product hexahedral meshes by integrating four well-established tensor-product PA optimizations (sum factorization, Voigt notation, macro-kernel fusion, and slice-wise loop reorganization) into MFEM's native linear-elasticity PA path. The resulting operator is evaluated in high-order GMG-PCG solves using MFEM's geometric multigrid (GMG) components. On AMD EPYC 7713, the optimized operator achieves $7\text{--}83\times$ kernel speedup and $3.6\text{--}16.8\times$ end-to-end speedup across $p\in\{1,2,4,8\}$. At fixed problem size, the kernel-time operator throughput peaks around $p=6$ and remains high at $p=8$, shifting the operator-throughput sweet spot to $p\ge 6$. The same trend is reproduced on Huawei~Kunpeng~920 (ARMv8.2). These results are accompanied by per-stage ablation and hardware-counter characterization; the implementation will be released on GitHub.
翻译:MFEM是一个广泛使用的有限元库,但其原生线弹性部分组装(PA)路径在单元算子中仍采用$O((p+1)^6)$的收缩计算,导致在我们的基线测量中,CPU算子吞吐量的最佳点位于$p\approx 2$附近。本研究通过将四种成熟的张量积PA优化技术(和分解、Voigt表示、宏核融合和切片循环重组)集成到MFEM原生的线弹性PA路径中,填补了仿射张量积六面体网格上MFEM线弹性问题的实现差距。得到的算子被应用于使用MFEM几何多重网格(GMG)组件的高阶GMG-PCG求解中。在AMD EPYC 7713上,优化的算子在$p\in\{1,2,4,8\}$范围内实现了$7\text{--}83\times$的核加速比和$3.6\text{--}16.8\times$的端到端加速比。在固定问题规模下,核时间算子吞吐量在$p=6$附近达到峰值,并在$p=8$时保持较高水平,从而将算子吞吐量的最佳点移至$p\ge 6$。相同的趋势在华为鲲鹏920(ARMv8.2)上得到复现。这些结果附有逐阶段消融实验和硬件计数器特征分析;该实现将在GitHub上发布。