Learning-Augmented Performance Model for Tensor Product Factorization in High-Order FEM

Accurate performance prediction is essential for optimizing scientific applications on modern high-performance computing (HPC) architectures. Widely used performance models primarily focus on cache and memory bandwidth, which is suitable for many memory-bound workloads. However, it is unsuitable for highly arithmetic intensive cases such as the sum-factorization with tensor $n$-mode product kernels, which are an optimization technique for high-order finite element methods (FEM). On processors with relatively high single instruction multiple data (SIMD) instruction latency, such as the Fujitsu A64FX, the performance of these kernels is strongly influenced by loop-body splitting strategies. Memory-bandwidth-oriented models are therefore not appropriate for evaluating these splitting configurations, and a model that directly reflects instruction-level efficiency is required. To address this need, we develop a dependency-chain-based analytical formulation that links loop-splitting configurations to instruction dependencies in the tensor $n$-mode product kernel. We further use XGBoost to estimate key parameters in the analytical model that are difficult to model explicitly. Evaluations show that the learning-augmented model outperforms the widely used standard Roofline and Execution-Cache-Memory (ECM) models. On the Fujitsu A64FX processor, the learning-augmented model achieves mean absolute percentage errors (MAPE) between 1% and 24% for polynomial orders ($P$) from 1 to 15. In comparison, the standard Roofline and ECM models yield errors of 42%-256% and 5%-117%, respectively. On the Intel Xeon Gold 6230 processor, the learning-augmented model achieves MAPE values from 1% to 13% for $P$=1 to $P$=14, and 24% at $P$=15. In contrast, the standard Roofline and ECM models produce errors of 1%-73% and 8%-112% for $P$=1 to $P$=15, respectively.

翻译：在现代高性能计算（HPC）架构上优化科学应用时，准确的性能预测至关重要。广泛使用的性能模型主要关注缓存和内存带宽，这适用于许多内存受限型工作负载。然而，对于高度算术密集型的场景，例如采用张量$n$-模积核的求和分解（一种用于高阶有限元方法（FEM）的优化技术），此类模型并不适用。在单指令多数据（SIMD）指令延迟相对较高的处理器（如富士通A64FX）上，这些核函数的性能受到循环体分割策略的显著影响。因此，面向内存带宽的模型不适合评估这些分割配置，需要一种能直接反映指令级效率的模型。为满足这一需求，我们开发了一种基于依赖链的解析公式，将循环分割配置与张量$n$-模积核中的指令依赖关系联系起来。我们进一步使用XGBoost来估计解析模型中难以显式建模的关键参数。评估结果表明，学习增强模型优于广泛使用的标准Roofline模型和执行-缓存-内存（ECM）模型。在富士通A64FX处理器上，对于多项式阶数（$P$）从1到15，学习增强模型的平均绝对百分比误差（MAPE）介于1%至24%之间。相比之下，标准Roofline模型和ECM模型的误差分别为42%-256%和5%-117%。在英特尔至强金牌6230处理器上，对于$P$=1到$P$=14，学习增强模型的MAPE值从1%到13%，在$P$=15时为24%。相比之下，对于$P$=1到$P$=15，标准Roofline模型和ECM模型的误差分别为1%-73%和8%-112%。