Multi-Partner Project: Multi-GPU Performance Portability Analysis for CFD Simulations at Scale

Panagiotis-Eleftherios Eleftherakis,George Anagnostopoulos,Anastassis Kapetanakis,Mohammad Umair,Jean-Yves Vet,Konstantinos Iliakis,Jonathan Vincent,Jing Gong,Akshay Patil,Clara García-Sánchez,Gerardo Zampino,Ricardo Vinuesa,Sotirios Xydis

from arxiv, DATE 26 conference Multi-Partner Project Paper

As heterogeneous supercomputing architectures leveraging GPUs become increasingly central to high-performance computing (HPC), it is crucial for computational fluid dynamics (CFD) simulations, a de-facto HPC workload, to efficiently utilize such hardware. One of the key challenges of HPC codes is performance portability, i.e. the ability to maintain near-optimal performance across different accelerators. In the context of the \textbf{REFMAP} project, which targets scalable, GPU-enabled multi-fidelity CFD for urban airflow prediction, this paper analyzes the performance portability of SOD2D, a state-of-the-art Spectral Elements simulation framework across AMD and NVIDIA GPU architectures. We first discuss the physical and numerical models underlying SOD2D, highlighting its computational hotspots. Then, we examine its performance and scalability in a multi-level manner, i.e. defining and characterizing an extensive full-stack design space spanning across application, software and hardware infrastructure related parameters. Single-GPU performance characterization across server-grade NVIDIA and AMD GPU architectures and vendor-specific compiler stacks, show the potential as well as the diverse effect of memory access optimizations, i.e. 0.69$\times$ - 3.91$\times$ deviations in acceleration speedup. Performance variability of SOD2D at scale is further examined on the LUMI multi-GPU cluster, where profiling reveals similar throughput variations, highlighting the limits of performance projections and the need for multi-level, informed tuning.

翻译：随着基于GPU的异构超级计算架构在高性能计算（HPC）中日益占据核心地位，作为实际HPC工作负载的计算流体动力学（CFD）仿真能否高效利用此类硬件变得至关重要。HPC代码面临的关键挑战之一是性能可移植性，即在不同加速器上保持接近最优性能的能力。在旨在实现可扩展、支持GPU的多保真度CFD以用于城市气流预测的\textbf{REFMAP}项目背景下，本文分析了SOD2D（一种先进的谱元仿真框架）在AMD与NVIDIA GPU架构上的性能可移植性。我们首先讨论了SOD2D的物理与数值模型，重点分析了其计算热点。随后，我们以多层次方式考察其性能与可扩展性，即定义并表征一个涵盖应用、软件及硬件基础设施相关参数的广泛全栈设计空间。在服务器级NVIDIA与AMD GPU架构及供应商特定编译器栈上的单GPU性能表征，揭示了内存访问优化的潜力及其多样化影响，即加速比存在0.69$\times$至3.91$\times$的偏差。进一步在LUMI多GPU集群上考察了SOD2D的大规模性能变异性，性能剖析显示出相似的吞吐量波动，这凸显了性能预测的局限性以及进行多层次精细化调优的必要性。