As exascale systems reach unprecedented concurrency, traditional performance analysis tools struggle with the overhead of massive-scale telemetry. We present an accelerated infrastructure for the hpcanalysis framework that leverages a high-performance C++ API and GPU parallelism to enable high-throughput diagnostics. Our C++ API achieves a 9.69-second ingestion time for 100,000 MPI ranks on Aurora. Furthermore, our GPU-accelerated layer achieves up to 314x speedup over CPU-based processing when analyzing 100,000 execution traces. Finally, we implement a topology-aware workflow that maps logical performance outliers to physical Slingshot interconnect coordinates, localizing network congestion across 22 distinct racks on Aurora. We also demonstrate how the framework's advanced interface seamlessly integrates with external tools to provide sophisticated analytical models. We introduce a novel tri-dimensional performance model that "re-materializes" iterative behavior from execution traces; using this model, we identified a 32.28% potential speedup for a GAMESS workload on Frontier.
翻译:随着百亿亿次系统达到前所未有的并发规模,传统性能分析工具在应对大规模遥测开销时面临挑战。我们为hpcanalysis框架提出了一种加速基础设施,该设施利用高性能C++ API和GPU并行性实现高吞吐量诊断。我们的C++ API在Aurora系统上实现了100,000个MPI秩的9.69秒数据摄取时间。此外,我们的GPU加速层在分析100,000个执行轨迹时,相较于CPU处理实现了高达314倍的加速。最后,我们实现了一种拓扑感知工作流,将逻辑性能异常映射到物理Slingshot互连坐标,在Aurora上跨22个不同机架定位网络拥塞。我们还展示了该框架的高级接口如何与外部工具无缝集成,以提供复杂的分析模型。我们提出了一种新颖的三维性能模型,该模型从执行轨迹中"重新物化"迭代行为;利用此模型,我们在Frontier上的GAMESS工作负载中识别出32.28%的潜在加速空间。