In the realm of Computational Fluid Dynamics (CFD), the demand for memory and computation resources is extreme, necessitating the use of leadership-scale computing platforms for practical domain sizes. This intensive requirement renders traditional checkpointing methods ineffective due to the significant slowdown in simulations while saving state data to disk. As we progress towards exascale and GPU-driven High-Performance Computing (HPC) and confront larger problem sizes, the choice becomes increasingly stark: to compromise data fidelity or to reduce resolution. To navigate this challenge, this study advocates for the use of in situ analysis and visualization techniques. These allow more frequent data "snapshots" to be taken directly from memory, thus avoiding the need for disruptive checkpointing. We detail our approach of instrumenting NekRS, a GPU-focused thermal-fluid simulation code employing the spectral element method (SEM), and describe varied in situ and in transit strategies for data rendering. Additionally, we provide concrete scientific use-cases and report on runs performed on Polaris, Argonne Leadership Computing Facility's (ALCF) 44 Petaflop supercomputer and J\"ulich Wizard for European Leadership Science (JUWELS) Booster, J\"ulich Supercomputing Centre's (JSC) 71 Petaflop High Performance Computing (HPC) system, offering practical insight into the implications of our methodology.
翻译:在计算流体动力学(CFD)领域,内存与计算资源的需求极为苛刻,若要处理实际规模的域,必须借助顶尖级计算平台。这种高强度需求使得传统检查点方法失效,因为将状态数据保存到磁盘会显著拖慢模拟进程。在迈向百亿亿次级及GPU驱动的高性能计算(HPC)、面对更大规模问题的过程中,我们面临的选择愈发严峻:要么牺牲数据保真度,要么降低分辨率。为应对这一挑战,本研究倡导采用原位分析与可视化技术。这些技术允许直接从内存中更频繁地获取数据“快照”,从而避免破坏性检查点操作。我们详细介绍了对NekRS(一款采用谱元法(SEM)的GPU热流体模拟代码)进行仪器化的方法,并描述了多种原位与传输中数据渲染策略。此外,我们提供了具体的科学应用案例,并报告了在阿尔贡领导力计算设施(ALCF)的Polaris(44 Petaflops超级计算机)以及于利希超级计算中心(JSC)的JUWELS Booster(71 Petaflops高性能计算系统)上执行的运行测试,为该方法论的实践意义提供了实际洞察。