On a daily basis, data centers process huge volumes of data backed by the proliferation of inexpensive hard disks. Data stored in these disks serve a range of critical functional needs from financial, and healthcare to aerospace. As such, premature disk failure and consequent loss of data can be catastrophic. To mitigate the risk of failures, cloud storage providers perform condition-based monitoring and replace hard disks before they fail. By estimating the remaining useful life of hard disk drives, one can predict the time-to-failure of a particular device and replace it at the right time, ensuring maximum utilization whilst reducing operational costs. In this work, large-scale predictive analyses are performed using severely skewed health statistics data by incorporating customized feature engineering and a suite of sequence learners. Past work suggests using LSTMs as an excellent approach to predicting remaining useful life. To this end, we present an encoder-decoder LSTM model where the context gained from understanding health statistics sequences aid in predicting an output sequence of the number of days remaining before a disk potentially fails. The models developed in this work are trained and tested across an exhaustive set of all of the 10 years of S.M.A.R.T. health data in circulation from Backblaze and on a wide variety of disk instances. It closes the knowledge gap on what full-scale training achieves on thousands of devices and advances the state-of-the-art by providing tangible metrics for evaluation and generalization for practitioners looking to extend their workflow to all years of health data in circulation across disk manufacturers. The encoder-decoder LSTM posted an RMSE of 0.83 during training and 0.86 during testing over the exhaustive 10 year data while being able to generalize competitively over other drives from the Seagate family.
翻译:数据中心每日处理海量数据,这得益于低成本硬盘的大量普及。存储于这些硬盘中的数据支撑着从金融、医疗到航空航天等一系列关键功能需求。因此,硬盘过早故障及由此导致的数据丢失可能造成灾难性后果。为降低故障风险,云存储提供商实施基于状态的监控,并在硬盘发生故障前予以更换。通过估算硬盘驱动器的剩余使用寿命,可以预测特定设备的失效时间,并在恰当时机进行更换,从而在确保最大化利用率的同时降低运维成本。本研究中,我们利用严重偏斜的健康统计数据,结合定制化特征工程与一套序列学习器,开展了大规模预测分析。先前研究表明,使用长短期记忆网络(LSTM)是预测剩余使用寿命的优异方法。为此,我们提出了一种编码器-解码器LSTM模型,其中通过理解健康统计序列所获得的上下文信息,有助于预测硬盘可能故障前的剩余天数输出序列。本研究开发的模型基于Backblaze发布的完整10年期S.M.A.R.T.健康统计数据,并在多种硬盘实例上进行了训练与测试。该工作填补了关于在数千台设备上进行全规模训练所能实现效果的认知空白,并通过提供可量化的评估指标与泛化方法,推动了当前技术发展水平,使从业者能够将其工作流程扩展至涵盖所有年份及各个硬盘制造商的所有健康数据。编码器-解码器LSTM在完整的10年数据上,训练均方根误差(RMSE)为0.83,测试RMSE为0.86,同时在希捷(Seagate)系列其他硬盘上也展现出具有竞争力的泛化能力。