AdapTBF: Decentralized Bandwidth Control via Adaptive Token Borrowing for HPC Storage

Modern high-performance computing (HPC) applications run on compute resources but share global storage systems. This design can cause problems when applications consume a disproportionate amount of storage bandwidth relative to their allocated compute resources. For example, an application running on a single compute node can issue many small, random writes and consume excessive I/O bandwidth from a storage server. This can hinder larger jobs that write to the same storage server and are allocated many compute nodes, resulting in significant resource waste. A straightforward solution is to limit each application's I/O bandwidth on storage servers in proportion to its allocated compute resources. This approach has been implemented in parallel file systems using Token Bucket Filter (TBF). However, strict proportional limits often reduce overall I/O efficiency because HPC applications generate short, bursty I/O. Limiting bandwidth can waste server capacity when applications are idle or prevent applications from temporarily using higher bandwidth during bursty phases. We argue that I/O control should maximize per-application performance and overall storage efficiency while ensuring fairness (e.g., preventing small jobs from blocking large-scale ones). We propose AdapTBF, which builds on TBF in modern parallel file systems (e.g., Lustre) and introduces a decentralized bandwidth control approach using adaptive borrowing and lending. We detail the algorithm, implement AdapTBF in Lustre, and evaluate it using synthetic workloads modeled after real-world scenarios. Results show that AdapTBF manages I/O bandwidth effectively while maintaining high storage utilization, even under extreme conditions.

翻译：现代高性能计算（HPC）应用程序运行于计算资源之上，但共享全局存储系统。当应用程序消耗的存储带宽与其分配的计算资源不成比例时，这种设计可能导致问题。例如，运行在单个计算节点上的应用程序可能发出大量小型随机写入操作，从而占用存储服务器的过量I/O带宽。这会阻碍写入同一存储服务器且分配了大量计算节点的大型作业，导致显著的资源浪费。一种直接的解决方案是在存储服务器上按应用程序分配的计算资源比例限制其I/O带宽。该方法已在并行文件系统中通过令牌桶过滤器（TBF）实现。然而，严格的按比例限制通常会降低整体I/O效率，因为HPC应用程序会产生短暂突发的I/O流量。当应用程序处于空闲状态时，限制带宽会浪费服务器容量；或在突发阶段阻止应用程序临时使用更高带宽。我们认为，I/O控制应在确保公平性（例如防止小型作业阻塞大规模作业）的同时，最大化单应用程序性能与整体存储效率。本文提出AdapTBF，该方法基于现代并行文件系统（如Lustre）中的TBF机制，引入了一种通过自适应借用与归还实现的去中心化带宽控制方法。我们详细阐述了算法原理，在Lustre中实现了AdapTBF，并使用基于真实场景建模的合成工作负载进行评估。结果表明，即使在极端条件下，AdapTBF也能在保持高存储利用率的同时有效管理I/O带宽。