IOAgent: Democratizing Trustworthy HPC I/O Performance Diagnosis Capability via LLMs

As the complexity of the HPC storage stack rapidly grows, domain scientists face increasing challenges in effectively utilizing HPC storage systems to achieve their desired I/O performance. To identify and address I/O issues, scientists largely rely on I/O experts to analyze their I/O traces and provide insights into potential problems. However, with a limited number of I/O experts and the growing demand for data-intensive applications, inaccessibility has become a major bottleneck, hindering scientists from maximizing their productivity. Rapid advances in LLMs make it possible to build an automated tool that brings trustworthy I/O performance diagnosis to domain scientists. However, key challenges remain, such as the inability to handle long context windows, a lack of accurate domain knowledge about HPC I/O, and the generation of hallucinations during complex interactions. In this work, we propose IOAgent as a systematic effort to address these challenges. IOAgent integrates a module-based pre-processor, a RAG-based domain knowledge integrator, and a tree-based merger to accurately diagnose I/O issues from a given Darshan trace file. Similar to an I/O expert, IOAgent provides detailed justifications and references for its diagnoses and offers an interactive interface for scientists to ask targeted follow-up questions. To evaluate IOAgent, we collected a diverse set of labeled job traces and released the first open diagnosis test suite, TraceBench. Using this test suite, we conducted extensive evaluations, demonstrating that IOAgent matches or outperforms state-of-the-art I/O diagnosis tools with accurate and useful diagnosis results. We also show that IOAgent is not tied to specific LLMs, performing similarly well with both proprietary and open-source LLMs. We believe IOAgent has the potential to become a powerful tool for scientists navigating complex HPC I/O subsystems in the future.

翻译：随着高性能计算存储栈的复杂性快速增长，领域科学家在有效利用高性能计算存储系统以实现预期I/O性能方面面临日益严峻的挑战。为识别和解决I/O问题，科学家主要依赖I/O专家分析其I/O追踪数据并提供潜在问题洞察。然而，由于I/O专家数量有限且数据密集型应用需求不断增长，专家资源的不可及性已成为主要瓶颈，阻碍科学家实现生产力最大化。大语言模型的快速发展使得构建自动化工具成为可能，从而为领域科学家提供可信的I/O性能诊断。然而，关键挑战依然存在，例如无法处理长上下文窗口、缺乏对高性能计算I/O的精确领域知识，以及在复杂交互过程中产生幻觉。本研究中，我们提出IOAgent作为应对这些挑战的系统性解决方案。IOAgent集成了基于模块的预处理器、基于检索增强生成的领域知识整合器以及基于树的合并器，能够从给定的Darshan追踪文件中准确诊断I/O问题。与I/O专家类似，IOAgent为其诊断提供详细论证和参考依据，并通过交互式界面支持科学家提出针对性后续问题。为评估IOAgent，我们收集了多样化的标注作业追踪数据集，并发布了首个开放式诊断测试套件TraceBench。使用该测试套件进行的广泛评估表明，IOAgent能够匹配或超越现有最先进的I/O诊断工具，提供准确有效的诊断结果。我们还证明IOAgent不依赖于特定大语言模型，在专有和开源大语言模型上均表现优异。我们相信IOAgent有望成为未来科学家驾驭复杂高性能计算I/O子系统的强大工具。