Modern high-performance computing (HPC) environments rely on hybrid storage systems (HSS) that combine multiple storage devices with diverse latency, bandwidth, endurance, and capacity characteristics to meet the performance, capacity, and cost requirements of data-intensive applications. The performance of an HSS highly depends on two key data-management policies: (1) data placement, which determines the most suitable storage device to store application data, and (2) data migration, which dynamically reorganizes previously-stored data across storage devices (i.e., prefetching hot data and evicting cold data) to sustain high HSS performance. These policies are tightly interdependent, and thus, improving one without considering the other leads to suboptimal HSS performance. Unfortunately, prior works focus on optimizing only one of the policies. Our goal is to design a holistic data-management technique that optimizes both data-placement and data-migration policies to fully exploit the potential of an HSS. To this end, we propose Harmonia, a multi-agent reinforcement learning (RL)-based data-management technique. Harmonia employs two lightweight autonomous RL agents, a data-placement agent and a data-migration agent, that adapt their policies for the current workload and HSS configuration while coordinating with each other. We evaluate Harmonia on real HSS configurations with up to four heterogeneous storage devices and 25 data-intensive workloads. On a performance- (cost-) optimized HSS with two heterogeneous storage devices, Harmonia outperforms the best-performing prior approach by 29.3% (44.8%) on average. On an HSS with three (four) devices, Harmonia outperforms the best-performing prior work by 38.9% (39.2%) on average. Harmonia's performance benefits come with low latency (240 ns for inference) and storage (206 KiB in DRAM for both RL agents combined) overheads.
翻译:现代高性能计算环境依赖混合存储系统,该系结合了具有不同延迟、带宽、耐久性和容量特性的多种存储设备,以满足数据密集型应用的性能、容量和成本需求。混合存储系统的性能高度取决于两项关键数据管理策略:(1) 数据放置,即确定存储应用数据的最优存储设备;(2) 数据迁移,即动态重组已存储数据在设备间的分布(如预取热数据、淘汰冷数据),以维持系统的高性能。这两项策略紧密耦合,若仅优化其中之一而忽视另一项,将导致混合存储系统性能次优。然而,现有研究仅聚焦于单策略优化。我们的目标是设计一种全局数据管理技术,同时优化数据放置与迁移策略,以充分挖掘混合存储系统的潜力。为此,我们提出Harmonia——一种基于多智能体强化学习的数据管理技术。Harmonia采用两个轻量级自主强化学习智能体:数据放置智能体与数据迁移智能体,它们能根据当前工作负载与系统配置自适应调整策略,并相互协同。我们在实际混合存储系统配置上对Harmonia进行评估,涉及最多四种异构存储设备与25种数据密集型工作负载。在性能优化型(成本优化型)双异构存储设备混合存储系统中,Harmonia相比最优现有方法平均性能提升29.3%(44.8%);在三设备(四设备)混合存储系统中,平均性能提升38.9%(39.2%)。Harmonia的性能收益伴随低延迟(推理延迟240纳秒)与低存储开销(双智能体合计动态随机存取内存占用206 KiB)。