Large-scale End-of-Life Prediction of Hard Disks in Distributed Datacenters

On a daily basis, data centers process huge volumes of data backed by the proliferation of inexpensive hard disks. Data stored in these disks serve a range of critical functional needs from financial, and healthcare to aerospace. As such, premature disk failure and consequent loss of data can be catastrophic. To mitigate the risk of failures, cloud storage providers perform condition-based monitoring and replace hard disks before they fail. By estimating the remaining useful life of hard disk drives, one can predict the time-to-failure of a particular device and replace it at the right time, ensuring maximum utilization whilst reducing operational costs. In this work, large-scale predictive analyses are performed using severely skewed health statistics data by incorporating customized feature engineering and a suite of sequence learners. Past work suggests using LSTMs as an excellent approach to predicting remaining useful life. To this end, we present an encoder-decoder LSTM model where the context gained from understanding health statistics sequences aid in predicting an output sequence of the number of days remaining before a disk potentially fails. The models developed in this work are trained and tested across an exhaustive set of all of the 10 years of S.M.A.R.T. health data in circulation from Backblaze and on a wide variety of disk instances. It closes the knowledge gap on what full-scale training achieves on thousands of devices and advances the state-of-the-art by providing tangible metrics for evaluation and generalization for practitioners looking to extend their workflow to all years of health data in circulation across disk manufacturers. The encoder-decoder LSTM posted an RMSE of 0.83 on an exhaustive set while being able to generalize competitively over the other Seagate family hard drives.

翻译：随着低成本硬盘的普及，数据中心每天处理海量数据。这些硬盘存储的数据支撑着从金融、医疗到航空航天等关键功能需求。因此，硬盘过早故障及其导致的数据丢失可能带来灾难性后果。为降低故障风险，云存储服务商采用基于状态的监控，在硬盘发生故障前进行更换。通过估计硬盘驱动器的剩余使用寿命，可以预测特定设备失效的时间并适时更换，从而在确保最大化利用率的同时降低运营成本。本研究利用严重偏斜的健康统计数据，结合自定义特征工程和一系列序列学习器，进行了大规模预测分析。以往研究表明，采用LSTM模型是预测剩余使用寿命的有效途径。为此，我们提出了一种编码器-解码器LSTM模型，该模型通过理解健康统计序列获得的上下文信息，有助于预测磁盘可能失效前的剩余天数序列。本研究开发的模型在Backblaze提供的10年间全部S.M.A.R.T.健康数据集以及多种磁盘实例上进行了训练和测试。该研究填补了关于数千台设备全规模训练所实现效果的认知空白，并通过提供可量化的评估指标和泛化方法，推动从业者将工作流程扩展至涵盖所有年份、所有制造商磁盘健康数据的研究前沿。该编码器-解码器LSTM模型在完整测试集上取得了0.83的均方根误差，同时能够对其他希捷系列硬盘实现具有竞争力的泛化效果。