多核集群上LULESH代理应用的解析式屋顶线建模与能耗分析 (Analytic Roofline Modeling and Energy Analysis of LULESH Proxy Application on Multi-Core Clusters)

We present a thorough performance and energy consumption analysis of the LULESH proxy application in its OpenMP and MPI variants on two different clusters based on Intel Ice Lake (ICL) and Sapphire Rapids (SPR) CPUs. We first study the strong scaling and power consumption characteristics of the six hot spot functions in the code on the node level, with a special focus on memory bandwidth utilization. We then proceed with the construction of a detailed Roofline performance model for each memory-bound hot spot, which we validate using hardware performance counter measurements. We also comment on the observed discrepancies between the analytical model and the observations. To discern the influence of the programming model from the influence of implementation of the code, we compare the performance of OpenMP and MPI based on problem size, examining if the underlying implementation is equivalent for large problems, and if differences in overheads are more significant at smaller problem sizes. We also conduct an analysis of the power dissipation, energy to solution, and energy-delay product (EDP) of the hot spots, quantifying the influence of problem size, core and uncore clock frequency, and number of active cores per ccNUMA domain. Relevant energy savings are only possible for memory-bound functions by using fewer cores per ccNUMA domain and/or reducing the core clock speed. A major issue is the very high extrapolated baseline power on both chips, which makes concurrency throttling less effective. In terms of energy-delay product (EDP), on SPR only memory-bound workloads offer lower EDP compared to Ice Lake.

翻译：本文对基于Intel Ice Lake (ICL)和Sapphire Rapids (SPR)处理器的两种不同集群上，LULESH代理应用的OpenMP与MPI变体进行了全面的性能与能耗分析。我们首先在节点级别研究了代码中六个热点函数的强扩展特性与功耗特征，并特别关注内存带宽利用率。随后，我们为每个内存受限热点构建了详细的屋顶线性能模型，并通过硬件性能计数器测量进行了验证。我们还对解析模型与实测数据之间的差异进行了评述。为区分编程模型的影响与代码实现的影响，我们基于问题规模比较了OpenMP与MPI的性能，考察了对于大规模问题底层实现是否等效，以及开销差异在较小问题规模下是否更为显著。我们还分析了热点函数的功耗、求解能耗及能耗延迟积(EDP)，量化了问题规模、核心/非核心时钟频率以及每个ccNUMA域中活跃核心数的影响。仅当对内存受限函数减少每个ccNUMA域的核心数和/或降低核心时钟频率时，才可能实现显著的节能效果。一个重要问题是两种芯片的推断基线功耗均非常高，这削弱了并发节流机制的有效性。就能耗延迟积(EDP)而言，在SPR平台上仅内存受限工作负载相比Ice Lake具有更低的EDP。