STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems

from arxiv, Published in the Proceedings of the 2025 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC25)

I/O performance is crucial to efficiency in data-intensive scientific computing; but tuning large-scale storage systems is complex, costly, and notoriously manpower-intensive, making it inaccessible for most domain scientists. To address this problem, we propose STELLAR, an autonomous tuner for high-performance parallel file systems. Our evaluations show that STELLAR almost always selects near-optimal parameter configurations for parallel file systems within the first five attempts, even for previously unseen applications. STELLAR differs fundamentally from traditional autotuning methods, which often require hundreds of thousands of iterations to converge. Powered by large language models (LLMs), STELLAR enables autonomous end-to-end agentic tuning by (1) accurately extracting tunable parameters from software manuals, (2) analyzing I/O trace logs generated by applications, (3) selecting initial tuning strategies, (4) rerunning applications on real systems and collecting I/O performance feedback, (5) adjusting tuning strategies and repeating the tuning cycle, and (6) reflecting on and summarizing tuning experiences into reusable knowledge for future optimizations. STELLAR integrates retrieval-augmented generation (RAG), tool execution, LLM-based reasoning, and a multiagent design to stabilize reasoning and combat hallucinations. We evaluate the impact of each component on optimization outcomes, providing design insights for similar systems in other optimization domains. STELLAR's architecture and empirical results highlight a promising approach to complex system optimization, especially for problems with large search spaces and high exploration costs, while making I/O tuning more accessible to domain scientists with minimal added resources.

翻译：I/O性能对数据密集型科学计算的效率至关重要；然而，大规模存储系统的调优过程复杂、成本高昂且众所周知地需要大量人力，这使得大多数领域科学家难以进行有效调优。为解决此问题，我们提出了STELLAR，一个用于高性能并行文件系统的自主调优器。我们的评估表明，STELLAR几乎总能在前五次尝试内为并行文件系统选择接近最优的参数配置，即使对于先前未见过的应用程序也是如此。STELLAR与传统自动调优方法有根本性不同，后者通常需要数十万次迭代才能收敛。借助大语言模型（LLMs）的驱动，STELLAR通过以下步骤实现自主端到端的智能调优：(1) 从软件手册中准确提取可调参数，(2) 分析应用程序生成的I/O跟踪日志，(3) 选择初始调优策略，(4) 在真实系统上重新运行应用程序并收集I/O性能反馈，(5) 调整调优策略并重复调优循环，以及(6) 反思并总结调优经验，形成可重用的知识用于未来优化。STELLAR集成了检索增强生成（RAG）、工具执行、基于LLM的推理以及多智能体设计，以稳定推理过程并对抗幻觉。我们评估了每个组件对优化结果的影响，为其他优化领域的类似系统提供了设计见解。STELLAR的架构和实证结果突显了一种应对复杂系统优化的有前景的方法，尤其适用于搜索空间大、探索成本高的问题，同时以最少的额外资源使领域科学家能够更便捷地进行I/O调优。