As exascale systems reach unprecedented concurrency, traditional performance analysis tools struggle with the overhead of massive-scale telemetry. We present an accelerated infrastructure for the hpcanalysis framework that leverages a high-performance C++ API and GPU parallelism to enable high-throughput diagnostics. Our C++ API achieves a 9.69-second ingestion time for 100,000 MPI ranks on Aurora. Furthermore, our GPU-accelerated layer achieves up to 314x speedup over CPU-based processing when analyzing 100,000 execution traces. Finally, we implement a topology-aware workflow that maps logical performance outliers to physical Slingshot interconnect coordinates, localizing network congestion across 22 distinct racks on Aurora. We also demonstrate how the framework's advanced interface seamlessly integrates with external tools to provide sophisticated analytical models. We introduce a novel tri-dimensional performance model that "re-materializes" iterative behavior from execution traces; using this model, we identified a 32.28% potential speedup for a GAMESS workload on Frontier.
翻译:随着百亿亿次系统达到前所未有的并发度,传统性能分析工具在面对大规模遥测数据的高昂开销时面临挑战。我们为hpcanalysis框架提出了一种加速基础设施,该框架利用高性能C++ API和GPU并行性实现高吞吐量诊断。我们的C++ API在Aurora系统上针对100,000个MPI秩实现了9.69秒的摄取时间。此外,我们的GPU加速层在分析100,000条执行轨迹时,相比基于CPU的处理实现了高达314倍的加速。最后,我们实现了一种拓扑感知工作流,将逻辑性能异常映射到物理Slingshot互连坐标,在Aurora系统上定位跨越22个不同机架的网络拥塞。我们还展示了该框架的高级接口如何与外部工具无缝集成,以提供复杂的分析模型。我们提出了一种新颖的三维性能模型,可从执行轨迹中"重新具象化"迭代行为;利用该模型,我们识别出在Frontier系统上运行GAMESS工作负载时潜在32.28%的加速效果。