Storage systems account for a major portion of the total cost of ownership (TCO) of warehouse-scale computers, and thus have a major impact on the overall system's efficiency. Machine learning (ML)-based methods for solving key problems in storage system efficiency, such as data placement, have shown significant promise. However, there are few known practical deployments of such methods. Studying this problem in the context of real-world hyperscale data centers at Google, we identify a number of challenges that we believe cause this lack of practical adoption. Specifically, prior work assumes a monolithic model that resides entirely within the storage layer, an unrealistic assumption in real-world deployments with frequently changing workloads. To address this problem, we introduce a cross-layer approach where workloads instead ''bring their own model''. This strategy moves ML out of the storage system and instead allows each workload to train its own lightweight model at the application layer, capturing the workload's specific characteristics. These small, interpretable models generate predictions that guide a co-designed scheduling heuristic at the storage layer, enabling adaptation to diverse online environments. We build a proof-of-concept of this approach in a production distributed computation framework at Google. Evaluations in a test deployment and large-scale simulation studies using production traces show improvements of as much as 3.47$\times$ in TCO savings compared to state-of-the-art baselines.
翻译:存储系统占仓库级计算机总体拥有成本(TCO)的主要部分,因而对整体系统效率具有重大影响。基于机器学习(ML)的方法在解决存储系统效率的关键问题(例如数据布局)方面已展现出显著潜力。然而,此类方法在实际部署中鲜有已知案例。通过在谷歌真实超大规模数据中心的背景下研究此问题,我们识别出若干挑战,我们认为这些挑战导致了实际应用匮乏。具体而言,先前工作假设一个完全位于存储层内的单体模型,这在具有频繁变化工作负载的真实部署中是不切实际的假设。为解决此问题,我们提出一种跨层方法,即工作负载“自带模型”。该策略将机器学习移出存储系统,转而允许每个工作负载在应用层训练其自身的轻量级模型,以捕获该工作负载的特定特征。这些小型、可解释的模型生成预测,用于指导存储层中协同设计的调度启发式算法,从而实现对多样化在线环境的适应。我们在谷歌的一个生产分布式计算框架中构建了此方法的原理验证。测试部署中的评估以及使用生产跟踪数据的大规模仿真研究表明,与最先进的基线方法相比,该方法在TCO节省方面最高可提升3.47倍。