In large-scale datacenters, memory failure is a common cause of server crashes, with Uncorrectable Errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) defects. Existing approaches primarily focus on predicting UEs using Correctable Errors (CEs), without fully considering the information provided by error bits. However, error bit patterns have a strong correlation with the occurrence of UEs. In this paper, we present a comprehensive study on the correlation between CEs and UEs, specifically emphasizing the importance of spatio-temporal error bit information. Our analysis reveals a strong correlation between spatio-temporal error bits and UE occurrence. Through evaluations using real-world datasets, we demonstrate that our approach significantly improves prediction performance by 15% in F1-score compared to the state-of-the-art algorithms. Overall, our approach effectively reduces the number of virtual machine interruptions caused by UEs by approximately 59%.
翻译:在大规模数据中心中,内存故障是服务器崩溃的常见原因,其中不可纠正错误(UEs)是双列直插式内存模块(DIMM)缺陷的主要指标。现有方法主要集中于利用可纠正错误(CEs)预测UEs,但未充分考虑错误位所提供的信息。然而,错误位模式与UEs的发生具有强相关性。本文对CEs与UEs之间的相关性进行了全面研究,特别强调了时空错误位信息的重要性。我们的分析揭示了时空错误位与UE发生之间的强相关性。通过使用真实数据集进行评估,我们证明,与最先进算法相比,该方法在F1分数上将预测性能提升了15%。总体而言,我们的方法将UEs导致的虚拟机中断次数减少了约59%。