With the rapid development of cloud computing and big data technologies, storage systems have become a fundamental building block of datacenters, incorporating hardware innovations such as flash solid state drives and non-volatile memories, as well as software infrastructures such as RAID and distributed file systems. Despite the growing popularity and interests in storage, designing and implementing reliable storage systems remains challenging, due to their performance instability and prevailing hardware failures. Proactive prediction greatly strengthens the reliability of storage systems. There are two dimensions of prediction: performance and failure. Ideally, through detecting in advance the slow IO requests, and predicting device failures before they really happen, we can build storage systems with especially low tail latency and high availability. While its importance is well recognized, such proactive prediction in storage systems, on the other hand, is particularly difficult. To move towards predictability of storage systems, various mechanisms and field studies have been proposed in the past few years. In this report, we present a survey of these mechanisms and field studies, focusing on machine learning based black-box approaches. Based on three representative research works, we discuss where and how machine learning should be applied in this field. The strengths and limitations of each research work are also evaluated in detail.
翻译:随着云计算和大数据技术的快速发展,存储系统已成为数据中心的基础设施核心,融合了闪存固态硬盘与非易失性存储器等硬件创新,以及RAID和分布式文件系统等软件架构。尽管存储系统日益受到关注与青睐,但其性能不稳定及硬件故障频发的问题,仍使可靠存储系统的设计与实现面临严峻挑战。主动预测能力可显著增强存储系统的可靠性,其预测维度涵盖性能与故障两个层面:通过提前侦测慢速I/O请求并在设备故障发生前进行预判,理想状态下可构建具有极低尾延迟和高可用性的存储系统。尽管主动预测的重要性已获广泛认可,但在存储系统中实现此类预测仍极具难度。为推进存储系统的可预测性,近年来业界提出了多种机制并开展了大量实地研究。本报告聚焦基于机器学习的黑盒方法,对这些机制与实地研究进行全面综述。通过分析三项代表性研究工作,我们探讨了机器学习在该领域的应用场景与实施路径,并详细评估了各项工作的优势与局限性。