In this paper we analyze the MPI-only version of the CloverLeaf code from the SPEChpc 2021 benchmark suite on recent Intel Xeon "Ice Lake" and "Sapphire Rapids" server CPUs. We observe peculiar breakdowns in performance when the number of processes is prime. Investigating this effect, we create first-principles data traffic models for each of the stencil-like hotspot loops. With application measurements and microbenchmarks to study memory data traffic behavior, we can connect the breakdowns to SpecI2M, a new write-allocate evasion feature in current Intel CPUs. We identify conditions under which SpecI2M works as intended and where it fails to avoid write-allocate transfers. Write-allocate evasion works best if large arrays are written consecutively; in the CloverLeaf code, non-temporal stores can be employed on top for best results. For serial and full-node cases we are able to predict the memory data volume analytically with an error of a few percent. We find that if the number of processes is prime, SpecI2M fails to work properly, which we can attribute to short inner loops emerging from the one-dimensional domain decomposition in this case. We can also rule out other possible causes of the prime number effect, such as breaking layer conditions, MPI communication overhead, and load imbalance.
翻译:本文分析了SPEChpc 2021基准套件中CloverLeaf代码的纯MPI版本在最新英特尔至强"Ice Lake"和"Sapphire Rapids"服务器CPU上的性能表现。我们发现当进程数为质数时会出现异常性能退化现象。为研究这一效应,我们针对每个类模板热点循环建立了基于第一性原理的数据流量模型。通过应用程序测量和微基准测试研究内存数据流量行为,我们将性能退化与当前英特尔CPU中新增的写分配规避特性SpecI2M关联起来。我们识别了SpecI2M正常运行和无法避免写分配传输的条件。当大型数组被连续写入时,写分配规避效果最佳;在此基础上,CloverLeaf代码可采用非临时存储指令以获得最优结果。对于串行和全节点场景,我们能够以百分之几的误差分析预测内存数据量。我们发现当进程数为质数时,SpecI2M无法正常工作,这可归因于一维域分解在此情况下产生的短内循环。此外,我们排除了质数效应的其他可能成因,如层条件破坏、MPI通信开销及负载不均衡等。