As DRAM scales to higher density and I/O speeds, ensuring data correctness becomes increasingly difficult. Industry has responded with a three-layer stack: on-die ECC (O-ECC), link ECC (L-ECC), and system ECC (S-ECC). However, these layers have evolved independently, often duplicating redundancy, leaving coverage gaps, and occasionally interfering. We propose Cerberus, a cross-layer ECC co-design that unifies protection across device, link, and system while preserving the native role of each layer. At its core is an Encode-Once, Decode-Many (EODM) architecture: the controller performs a single encoding whose redundancy is reused by L-ECC for immediate write-path detection and retry, by O-ECC for in-device repair on reads, and by S-ECC for strong end-to-end recovery. Cerberus jointly designs complementary parity and syndrome structures, orders decoders, and allocates the correction budget to prevent miscorrection amplification and enable selective correction under tight redundancy constraints. Our evaluations show improved resilience to clustered and peripheral faults while reducing redundant overhead, underscoring the importance of coordinated cross-layer protection for next-generation memory systems, such as custom HBMs.
翻译:随着DRAM向更高密度和更快I/O速度发展,保证数据正确性日益困难。业界已采用三层堆栈响应:片内ECC(O-ECC)、链路ECC(L-ECC)和系统ECC(S-ECC)。然而,这些层各自独立演进,常导致冗余重复、覆盖漏洞甚至相互干扰。我们提出Cerberus——一种跨层ECC协同设计方案,统一设备层、链路层和系统层的保护机制,同时保留各层的原生作用。其核心是"一次编码、多次译码"(EODM)架构:控制器执行单次编码,其冗余被L-ECC用于即时写路径检测与重试,被O-ECC用于片内读路径修复,被S-ECC用于强端到端恢复。Cerberus协同设计互补的奇偶校验与校正子结构,优化译码器顺序,分配纠错预算以防止误纠放大,并能在严格冗余约束下实现选择性纠错。评估表明,该方法在减少冗余开销的同时,增强了对簇状故障和外围故障的抵御能力,凸显了下一代内存系统(如定制HBM)中协同跨层保护的重要性。