Performance analysis is an essential task in High-Performance Computing (HPC) systems and it is applied for different purposes such as anomaly detection, optimal resource allocation, and budget planning. HPC monitoring tasks generate a huge number of Key Performance Indicators (KPIs) to supervise the status of the jobs running in these systems. KPIs give data about CPU usage, memory usage, network (interface) traffic, or other sensors that monitor the hardware. Analyzing this data, it is possible to obtain insightful information about running jobs, such as their characteristics, performance, and failures. The main contribution in this paper is to identify which metric/s (KPIs) is/are the most appropriate to identify/classify different types of jobs according to their behavior in the HPC system. With this aim, we have applied different clustering techniques (partition and hierarchical clustering algorithms) using a real dataset from the Galician Computation Center (CESGA). We have concluded that (i) those metrics (KPIs) related to the Network (interface) traffic monitoring provide the best cohesion and separation to cluster HPC jobs, and (ii) hierarchical clustering algorithms are the most suitable for this task. Our approach was validated using a different real dataset from the same HPC center.
翻译:性能分析是高性能计算系统中的关键任务,广泛应用于异常检测、最优资源分配及预算规划等场景。高性能计算监控任务会产生海量关键性能指标用于监督系统中运行作业的状态。这些指标提供关于CPU使用率、内存占用、网络接口流量及其他硬件监测传感器的数据。通过分析这些数据,可以获取运行作业的特征、性能表现及故障等深层信息。本文的主要贡献在于识别哪些关键性能指标最适合根据作业在高性能计算系统中的行为特征进行分类。为此,我们采用来自加利西亚计算中心的真实数据集,运用了多种聚类技术(包括划分聚类与层次聚类算法)。研究结论表明:(i) 与网络接口流量监控相关的关键性能指标在作业聚类中展现出最佳的内聚性与分离度;(ii) 层次聚类算法最适用于此任务。我们的方法已通过同一高性能计算中心的不同真实数据集进行了验证。