This paper is a corrigendum to the paper by Beigi et al. published at HPCA 2023 https://doi.org/10.1109/HPCA56546.2023.10071066. The HPCA paper presented a detailed field data analysis of faults observed at scale in DDR4 DRAM from two different memory vendors. This analysis included a breakdown of fault patterns or modes. Upon further study of the data, we found a bug in how we decoded errors based on the logged row-bank-column address. Specifically, we found that some errors that occurred in one column were mis-interpreted as occurring in two non-adjacent columns. As a result of this, some single-bit faults were misclassified as partial-row faults (i.e., two-bit faults). Similarly, some single-column faults were misclassified as two-column faults. The result of these misclassification errors is that the proportion of single-bit faults is higher than reported in the paper, with a commensurate reduction in the fraction of certain types of multi-bit faults. These misclassifications also slightly change the Failure In Time (FIT) per DRAM device values presented in the original paper. In this corrigendum, we provide an updated version of the relevant tables and figures and point out the corresponding page numbers and references in the original paper that they replace.
翻译:本文是对Beigi等人发表于HPCA 2023(https://doi.org/10.1109/HPCA56546.2023.10071066)的论文的勘误。原HPCA论文对来自两家不同内存供应商的DDR4 DRAM在大规模运行中观测到的故障进行了详细的现场数据分析,其中包含对故障模式或类型的分类。通过对数据的进一步研究,我们发现基于记录的行-存储体-列地址进行错误解码时存在一个程序缺陷。具体而言,我们发现发生在单一列中的某些错误被误判为发生在两个非相邻列中。这导致部分单比特故障被错误归类为部分行故障(即双比特故障)。类似地,部分单列故障被误判为双列故障。这些分类错误导致单比特故障的实际比例高于原论文报告值,而某些类型的多比特故障比例相应降低。这些误分类也轻微改变了原论文中呈现的每DRAM器件失效率(FIT)数值。在本勘误中,我们提供了相关图表和数据的更新版本,并指明了它们所替代的原论文中的对应页码和参考文献。