Hard Disk Drive (HDD) failures in datacenters are costly - from catastrophic data loss to a question of goodwill, stakeholders want to avoid it like the plague. An important tool in proactively monitoring against HDD failure is timely estimation of the Remaining Useful Life (RUL). To this end, the Self-Monitoring, Analysis and Reporting Technology employed within HDDs (S.M.A.R.T.) provide critical logs for long-term maintenance of the security and dependability of these essential data storage devices. Data-driven predictive models in the past have used these S.M.A.R.T. logs and CNN/RNN based architectures heavily. However, they have suffered significantly in providing a confidence interval around the predicted RUL values as well as in processing very long sequences of logs. In addition, some of these approaches, such as those based on LSTMs, are inherently slow to train and have tedious feature engineering overheads. To overcome these challenges, in this work we propose a novel transformer architecture - a Temporal-fusion Bi-encoder Self-attention Transformer (TFBEST) for predicting failures in hard-drives. It is an encoder-decoder based deep learning technique that enhances the context gained from understanding health statistics sequences and predicts a sequence of the number of days remaining before a disk potentially fails. In this paper, we also provide a novel confidence margin statistic that can help manufacturers replace a hard-drive within a time frame. Experiments on Seagate HDD data show that our method significantly outperforms the state-of-the-art RUL prediction methods during testing over the exhaustive 10-year data from Backblaze (2013-present). Although validated on HDD failure prediction, the TFBEST architecture is well-suited for other prognostics applications and may be adapted for allied regression problems.
翻译:数据中心硬盘驱动器(HDD)故障会带来高昂代价——从灾难性数据丢失到信誉问题,利益相关者都避之不及。主动监测HDD故障的重要工具是及时估算剩余使用寿命(RUL)。为此,HDD内置的自我监测、分析与报告技术(S.M.A.R.T.)为这些关键数据存储设备的长期安全性与可靠性维护提供了关键日志。以往数据驱动的预测模型多依赖S.M.A.R.T.日志及基于CNN/RNN的架构,但在预测RUL值的置信区间以及处理超长序列日志方面存在显著不足。此外,部分方法(如基于LSTM的方法)训练速度本质较慢,且特征工程开销繁琐。为克服这些挑战,本文提出一种新型Transformer架构——时序融合双向编码自注意力Transformer(TFBEST),用于硬盘故障预测。该技术基于编码器-解码器深度学习机制,通过增强对健康状态序列的上下文理解,预测磁盘潜在故障前的剩余天数序列。本文还提出了一种新型置信裕度统计量,可帮助制造商在指定时间窗口内更换硬盘。在Seagate HDD数据上的实验表明,该方法在Backblaze长达十年(2013年至今)的全面测试数据上显著优于现有最先进的RUL预测方法。虽以HDD故障预测验证,但TFBEST架构同样适用于其他预测应用场景,并可拓展至相关回归问题。