Understanding HPC facilities users' behaviors and how computational resources are requested and utilized is not only crucial for the cluster productivity but also essential for designing and constructing future exascale HPC systems. This paper tackles Challenge 4, 'Analyzing Resource Utilization and User Behavior on Titan Supercomputer', of the 2021 Smoky Mountains Conference Data Challenge. Specifically, we dig deeper inside the records of Titan to discover patterns and extract relationships. This paper explores the workload distribution and usage patterns from resource manager system logs, GPU traces, and scientific areas information collected from the Titan supercomputer. Furthermore, we want to know how resource utilization and user behaviors change over time. Using data science methods, such as correlations, clustering, or neural networks, our findings allow us to investigate how projects, jobs, nodes, GPUs and memory are related. We provide insights about seasonality usage of resources and a predictive model for forecasting utilization of Titan Supercomputer. In addition, the described methodology can be easily adopted in other HPC clusters.
翻译:理解高性能计算(HPC)设施的用户行为,以及计算资源如何被请求和利用,不仅对集群生产力至关重要,而且对设计和构建未来的百亿亿次(exascale)HPC系统也至关重要。本文针对2021年烟山会议数据挑战赛的挑战课题四——“分析泰坦超级计算机上的资源利用与用户行为”。具体而言,我们深入挖掘泰坦超级计算机的记录,以发现模式并提取关系。本文探索了源自资源管理器系统日志、GPU轨迹以及从Titan超级计算机收集的科学领域信息中的工作负载分布与使用模式。此外,我们旨在了解资源利用和用户行为随时间的变化。通过采用数据科学方法,如相关性分析、聚类或神经网络,我们的发现使我们能够研究项目、作业、节点、GPU与内存之间的关联。我们提供了关于资源季节性使用的见解,并构建了用于预测泰坦超级计算机利用率的预测模型。此外,所述方法可轻松应用于其他HPC集群。