Storage systems account for a major portion of the total cost of ownership (TCO) of warehouse-scale computers, and thus have a major impact on the overall system's efficiency. Machine learning (ML)-based methods for solving key problems in storage system efficiency, such as data placement, have shown significant promise. However, there are few known practical deployments of such methods. Studying this problem in the context of real-world hyperscale data center deployments at Google, we identify a number of challenges that we believe cause this lack of practical adoption. Specifically, prior work assumes a monolithic model that resides entirely within the storage layer, an unrealistic assumption in real-world data center deployments. We propose a cross-layer approach that moves ML out of the storage system and performs it in the application running on top of it, co-designed with a scheduling algorithm at the storage layer that consumes predictions from these application-level models. This approach combines small, interpretable models with a co-designed heuristic that adapts to different online environments. We build a proof-of-concept of this approach in a production distributed computation framework at Google. Evaluations in a test deployment and large-scale simulation studies using production traces show improvements of as much as 3.47x in TCO savings compared to state of the art baselines. We believe this work represents a significant step towards more practical ML-driven storage placement in warehouse-scale computers.
翻译:存储系统占仓库级计算机总体拥有成本(TCO)的主要部分,因此对整体系统效率具有重大影响。基于机器学习(ML)的方法在解决存储系统效率中的关键问题(例如数据布局)方面已显示出巨大潜力。然而,此类方法在实际部署中鲜有应用。通过在谷歌真实超大规模数据中心部署的背景下研究此问题,我们识别出若干挑战,我们认为这些挑战导致了实际应用不足。具体而言,先前工作假设存在一个完全位于存储层内的单体模型,这在现实世界的数据中心部署中是不切实际的假设。我们提出一种跨层方法,将机器学习移出存储系统,并在其上运行的应用程序中执行,同时与存储层的调度算法协同设计,该算法利用这些应用级模型的预测。该方法将小型、可解释的模型与能够适应不同在线环境的协同设计启发式方法相结合。我们在谷歌的一个生产分布式计算框架中构建了该方法的原理验证。测试部署中的评估以及使用生产轨迹的大规模仿真研究表明,与最先进的基线相比,TCO节省最高可提升3.47倍。我们相信这项工作代表了在仓库级计算机中实现更实用的机器学习驱动存储布局方面迈出的重要一步。