In large-scale datacenters, memory failure is a common cause of server crashes, with uncorrectable errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) defects. Existing approaches primarily focus on predicting UEs using correctable errors (CEs), without fully considering the information provided by error bits. However, error bit patterns have a strong correlation with the occurrence of uncorrectable errors (UEs). In this paper, we present a comprehensive study on the correlation between CEs and UEs, specifically emphasizing the importance of spatio-temporal error bit information. Our analysis reveals a strong correlation between spatio-temporal error bits and UE occurrence. Through evaluations using real-world datasets, we demonstrate that our approach significantly improves prediction performance by 15% in F1-score compared to the state-of-the-art algorithms. Overall, our approach effectively reduces the number of virtual machine interruptions caused by UEs by approximately 59%.
翻译:在大型数据中心中,内存故障是服务器崩溃的常见原因,其中不可纠正错误(UEs)是双列直插式内存模块(DIMM)缺陷的主要指标。现有方法主要关注利用可纠正错误(CEs)预测UEs,而未充分考虑错误比特所提供的信息。然而,错误比特模式与不可纠正错误(UEs)的发生存在强相关性。本文对CEs与UEs之间的关联性进行了全面研究,特别强调了时空错误比特信息的重要性。我们的分析揭示了时空错误比特与UE发生之间的强相关性。通过基于真实数据集的评估,我们证明相较于现有最优算法,本文方法在F1-score上实现了15%的预测性能提升。总体而言,我们的方法可将由UEs引起的虚拟机中断次数减少约59%。