Spin-Transfer Torque Magnetic RAM (STT-MRAM) as one of the most promising replacements for SRAMs in on-chip cache memories benefits from higher density and scalability, near-zero leakage power, and non-volatility, but its reliability is threatened by high read disturbance error rate. Error-Correcting Codes (ECCs) are conventionally suggested to overcome the read disturbance errors in STT-MRAM caches. By employing aggressive ECCs and checking out a cache block on every read access, a high level of cache reliability is achieved. However, to minimize the cache access time in modern processors, all blocks in the target cache set are simultaneously read in parallel for tags comparison operation and only the requested block is sent out, if any, after checking its ECC. These extra cache block reads without checking their ECCs until requesting the blocks by the processor cause the accumulation of read disturbance error, which significantly degrade the cache reliability. In this paper, we first introduce and formulate the read disturbance accumulation phenomenon and reveal that this accumulation due to conventional parallel accesses of cache blocks significantly increases the cache error rate. Then, we propose a simple yet effective scheme, so-called Read Error Accumulation Preventer cache (REAP-cache), to completely eliminate the accumulation of read disturbances without compromising the cache performance. Our evaluations show that the proposed REAP-cache extends the cache Mean Time To Failure (MTTF) by 171x, while increases the cache area by less than 1% and energy consumption by only 2.7%.
翻译:自旋转移矩磁性随机存储器(STT-MRAM)作为片上缓存中替代SRAM最具潜力的技术之一,具有高密度、高可扩展性、近乎零的静态功耗以及非易失性等优势,但其可靠性受到高读取干扰错误率的威胁。传统上建议采用纠错码(ECC)来克服STT-MRAM缓存中的读取干扰错误。通过采用强纠错码并在每次读取访问时检查缓存块,可实现高水平的缓存可靠性。然而,在现代处理器中为最小化缓存访问时间,目标缓存组中的所有块会并行同时读取以进行标签比较操作,仅当请求的块存在且通过ECC检查后才被送出。这些额外的缓存块读取在处理器请求前未进行ECC检查,导致读取干扰错误的累积,从而显著降低缓存可靠性。本文首先引入并形式化描述了读取干扰累积现象,揭示了由于缓存块传统并行访问导致的这种累积会显著增加缓存错误率。随后,我们提出了一种简单而有效的方案——称为读取错误累积防止缓存(REAP-cache),可在不影响缓存性能的前提下完全消除读取干扰的累积。评估结果表明,所提出的REAP-cache将缓存平均故障间隔时间(MTTF)延长了171倍,同时仅增加不足1%的缓存面积和2.7%的能耗。