Tuning parallel file system in High-Performance Computing (HPC) systems remains challenging due to the complex I/O paths, diverse I/O patterns, and dynamic system conditions. While existing autotuning frameworks have shown promising results in tuning PFS parameters based on applications' I/O patterns, they lack scalability, adaptivity, and the ability to operate online. In this work, focusing on scalable online tuning, we present CARAT, an ML-guided framework to co-tune client-side RPC and caching parameters of PFS, leveraging only locally observable metrics. Unlike global or pattern-dependent approaches, CARAT enables each client to make independent and intelligent tuning decisions online, responding to real-time changes in both application I/O behaviors and system states. We then prototyped CARAT using Lustre and evaluated it extensively across dynamic I/O patterns, real-world HPC workloads, and multi-client deployments. The results demonstrated that CARAT can achieve up to 3x performance improvement over the default or static configurations, validating the effectiveness and generality of our approach. Due to its scalability and lightweight, we believe CARAT has the potential to be widely deployed into existing PFS and benefit various data-intensive applications.
翻译:在高性能计算系统中,由于复杂的I/O路径、多样化的I/O模式以及动态变化的系统状态,并行文件系统的调优仍面临严峻挑战。现有自动调优框架虽能基于应用程序的I/O模式对PFS参数进行调优并取得显著效果,但其可扩展性、自适应性和在线运行能力存在不足。本研究聚焦于可扩展的在线调优,提出CARAT——一种基于机器学习引导的框架,仅利用本地可观测指标即可协同调优PFS的客户端RPC与缓存参数。与全局或模式依赖型方法不同,CARAT使每个客户端能够在线做出独立智能的调优决策,实时响应应用程序I/O行为与系统状态的变化。我们基于Lustre实现了CARAT原型,并在动态I/O模式、真实HPC工作负载及多客户端部署场景中进行了全面评估。实验结果表明,相较于默认或静态配置,CARAT最高可实现3倍的性能提升,验证了该方法的有效性与普适性。凭借其可扩展性与轻量化特性,我们相信CARAT具备广泛部署于现有PFS的潜力,能为各类数据密集型应用带来性能增益。