Parallel applications can spend a significant amount of time performing I/O on large-scale supercomputers. Fast near-compute storage accelerators called burst buffers can reduce the time a processor spends performing I/O and mitigate I/O bottlenecks. However, determining if a given application could be accelerated using burst buffers is not straightforward even for storage experts. The relationship between an application's I/O characteristics (such as I/O volume, processes involved, etc.) and the best storage sub-system for it can be complicated. As a result, adapting parallel applications to use burst buffers efficiently is a trial-and-error process. In this work, we present a Python-based tool called PrismIO that enables programmatic analysis of I/O traces. Using PrismIO, we identify bottlenecks on burst buffers and parallel file systems and explain why certain I/O patterns perform poorly. Further, we use machine learning to model the relationship between I/O characteristics and burst buffer selections. We run IOR (an I/O benchmark) with various I/O characteristics on different storage systems and collect performance data. We use the data as the input for training the model. Our model can predict if a file of an application should be placed on BBs for unseen IOR scenarios with an accuracy of 94.47% and for four real applications with an accuracy of 95.86%.
翻译:并行应用程序在大型超级计算机上执行I/O操作时可能耗费大量时间。名为突发缓冲区的快速近计算存储加速器能够减少处理器执行I/O操作的时间,并缓解I/O瓶颈。然而,即使对于存储专家来说,判断给定应用程序是否可通过突发缓冲区加速也并非易事。应用程序的I/O特性(如I/O数据量、参与进程数等)与最佳存储子系统之间的关系十分复杂。因此,使并行应用程序高效利用突发缓冲区往往需要反复试验。本研究提出一款基于Python的工具PrismIO,可对I/O轨迹进行程序化分析。利用PrismIO,我们识别出突发缓冲区和并行文件系统中的瓶颈,并解释了特定I/O模式性能低下的原因。此外,我们采用机器学习方法建立I/O特性与突发缓冲区选择之间的关联模型。通过在异构存储系统上运行I/O基准测试工具IOR,收集不同I/O特性下的性能数据作为模型训练输入。该模型对未见过IOR场景下应用程序文件是否应置于突发缓冲区的预测准确率达94.47%,对四种实际应用的预测准确率达95.86%。