Generalized Reed-Solomon (RS) codes are a common choice for efficient, reliable error correction in memory and communications systems. These codes add $2t$ extra parity symbols to a block of memory, and can efficiently and reliably correct up to $t$ symbol errors in that block. Decoding is possible beyond this bound, but it is imperfectly reliable and often computationally expensive. Beyond-bound decoding is an important problem to solve for error-correcting Dynamic Random Access Memory (DRAM). These memories are often designed so that each access touches two extra memory devices, so that a failure in any one device can be corrected. But system architectures increasingly require DRAM to store metadata in addition to user data. When the metadata replaces parity data, a single-device failure is then beyond-bound. An error-correction system can either protect each access with a single RS code, or divide it into several segments protected with a shorter code, usually in an Interleaved Reed-Solomon (IRS) configuration. The full-block RS approach is more reliable, both at correcting errors and at preventing silent data corruption (SDC). The IRS option is faster, and is especially efficient at beyond-bound correction of single- or double-device failures. Here we describe a new family of "unraveling" Reed-Solomon codes that bridges the gap between these options. Our codes are full-block generalized RS codes, but they can also be decoded using an IRS decoder. As a result, they combine the speed and beyond-bound correction capabilities of interleaved codes with the robustness of full-block codes, including the ability of the latter to reliably correct failures across multiple devices. We show that unraveling codes are an especially good fit for high-reliability DRAM error correction.
翻译:广义里德-所罗门(RS)码是存储和通信系统中实现高效可靠纠错的常用选择。这些码向存储块中添加$2t$个额外奇偶符号,能够高效可靠地纠正该块中最多$t$个符号错误。在超过此界限时仍可进行解码,但可靠性不完善且计算代价通常较高。超界解码是纠错动态随机存取存储器(DRAM)中需要解决的重要问题。此类存储器通常设计为每次访问涉及两个额外存储器件,从而可纠正任一器件的故障。但系统架构日益要求DRAM除存储用户数据外还需存储元数据。当元数据取代奇偶数据时,单器件故障即构成超界情况。纠错系统既可采用单个RS码保护每次访问,也可将其划分为若干段,并用较短码(通常采用交织里德-所罗门(IRS)配置)保护。全块RS方法在纠错和防止静默数据损坏(SDC)方面可靠性更高。IRS选项速度更快,且在单器件或双器件故障的超界纠正中尤为高效。本文描述了一种新型"解交织"里德-所罗门码族,它填补了上述选项之间的空白。我们的码是全块广义RS码,但也能够使用IRS解码器进行解码。因此,它们结合了交织码的速度和超界纠正能力与全块码的鲁棒性,包括后者可靠纠正跨多器件故障的能力。我们证明了解交织码尤其适用于高可靠性DRAM纠错。