In the realm of Computational Fluid Dynamics (CFD), the demand for memory and computation resources is extreme, necessitating the use of leadership-scale computing platforms for practical domain sizes. This intensive requirement renders traditional checkpointing methods ineffective due to the significant slowdown in simulations while saving state data to disk. As we progress towards exascale and GPU-driven High-Performance Computing (HPC) and confront larger problem sizes, the choice becomes increasingly stark: to compromise data fidelity or to reduce resolution. To navigate this challenge, this study advocates for the use of \textit{in situ} analysis and visualization techniques. These allow more frequent data "snapshots" to be taken directly from memory, thus avoiding the need for disruptive checkpointing. We detail our approach of instrumenting NekRS, a GPU-focused thermal-fluid simulation code employing the spectral element method (SEM), and describe varied \textit{in situ} and in transit strategies for data rendering. Additionally, we provide concrete scientific use-cases and report on runs performed on Polaris, Argonne Leadership Computing Facility's (ALCF) 44 Petaflop supercomputer and J\"ulich Wizard for European Leadership Science (JUWELS) Booster, J\"ulich Supercomputing Centre's (JSC) 71 Petaflop High Performance Computing (HPC) system, offering practical insight into the implications of our methodology.
翻译:在计算流体力学(CFD)领域,对内存与计算资源的需求极为苛刻,迫使实际规模问题必须依赖领导级计算平台。这种密集型需求使传统检查点方法失效——将状态数据保存至磁盘会显著拖慢模拟进程。随着我们迈向百亿亿次计算与GPU驱动的高性能计算(HPC),并面临更大的问题规模,选择愈发严峻:要么牺牲数据保真度,要么降低分辨率。为应对这一挑战,本研究主张采用原位分析与可视化技术。这些技术允许直接从内存中更频繁地获取数据"快照",从而避免破坏性检查点的需求。我们详细阐述了如何对采用谱元法(SEM)的GPU热流体模拟代码NekRS进行仪器化改造,并描述了多种数据渲染的原位与传输中策略。此外,我们提供了具体科学应用案例,报告了在阿贡领导力计算设施(ALCF)44千万亿次超级计算机Polaris与于利希超级计算中心(JSC)71千万亿次高性能计算系统JUWELS Booster上执行的运行结果,为方法的实际影响提供了实用见解。