Understanding cluster-wide I/O patterns of large-scale HPC clusters is essential to minimize the occurrence and impact of I/O interference. Yet, most previous work in this area focused on monitoring and predicting task and node-level I/O burst events. This paper analyzes Darshan reports from three supercomputers to extract system-level read and write I/O rates in five minutes intervals. We observe significant (over 100x) fluctuations in read and write I/O rates in all three clusters. We then train machine learning models to estimate the occurrence of system-level I/O bursts 5 - 120 minutes ahead. Evaluation results show that we can predict I/O bursts with more than 90% accuracy (F-1 score) five minutes ahead and more than 87% accuracy two hours ahead. We also show that the ML models attain more than 70% accuracy when estimating the degree of the I/O burst. We believe that high-accuracy predictions of I/O bursts can be used in multiple ways, such as postponing delay-tolerant I/O operations (e.g., checkpointing), pausing nonessential applications (e.g., file system scrubbers), and devising I/O-aware job scheduling methods. To validate this claim, we simulated a burst-aware job scheduler that can postpone the start time of applications to avoid I/O bursts. We show that the burst-aware job scheduling can lead to an up to 5x decrease in application runtime.
翻译:理解大规模高性能计算集群的全局I/O模式对于最小化I/O干扰的发生及其影响至关重要。然而,该领域先前的大多数工作聚焦于监控和预测任务级及节点级的I/O突发事件。本文分析了来自三台超级计算机的Darshan报告,以五分钟为时间间隔提取系统级的读写I/O速率。我们观察到这三台集群中读写I/O速率均存在显著波动(超过百倍)。随后,我们训练机器学习模型以提前5至120分钟预测系统级I/O突发的发生。评估结果表明:我们能在五分钟前以超过90%的准确率(F-1分数)预测I/O突发,在两小时前的准确率仍超过87%。同时,ML模型在估计I/O突发程度时的准确率超过70%。我们相信高精度的I/O突发预测可应用于多种场景,例如推迟可容忍延迟的I/O操作(如检查点)、暂停非关键应用(如文件系统清理程序)以及设计I/O感知型作业调度方法。为验证这一主张,我们模拟了突发感知型作业调度器,该调度器可推迟应用的启动时间以避免I/O突发。实验表明,这种突发感知型作业调度能使应用运行时间降低至原来的五分之一。