As DRAM scales to higher density and I/O speeds, ensuring data correctness becomes increasingly difficult. Industry has responded with a three-layer stack: on-die ECC (O-ECC), link ECC (L-ECC), and system ECC (S-ECC). However, these layers have evolved independently, often duplicating redundancy, leaving coverage gaps, and occasionally interfering. We propose Cerberus, a cross-layer ECC co-design that unifies protection across device, link, and system while preserving the native role of each layer. At its core is an Encode-Once, Decode-Many (EODM) architecture: the controller performs a single encoding whose redundancy is reused by L-ECC for immediate write-path detection and retry, by O-ECC for in-device repair on reads, and by S-ECC for strong end-to-end recovery. Cerberus jointly designs complementary parity and syndrome structures, orders decoders, and allocates the correction budget to prevent miscorrection amplification and enable selective correction under tight redundancy constraints. Our evaluations show improved resilience to clustered and peripheral faults while reducing redundant overhead, underscoring the importance of coordinated cross-layer protection for next-generation memory systems, such as custom HBMs.
翻译:随着DRAM向更高密度和更快I/O速度发展,确保数据正确性日益困难。业界已采用三层防护架构:片内ECC(O-ECC)、链路ECC(L-ECC)和系统ECC(S-ECC)。然而这些层级各自独立演进,常导致冗余重复、防护盲区甚至相互干扰。我们提出Cerberus这一跨层级ECC协同设计方案,在保持各层级固有功能的前提下统一实现器件层、链路层与系统层的防护。其核心是"一次编码多次解码"(EODM)架构:控制器执行单次编码,其冗余信息分别被L-ECC用于写入路径的即时检测与重试、O-ECC用于读取时片内修复、S-ECC用于强健的端到端恢复。Cerberus联合设计了互补奇偶校验与校正子结构,对解码器进行有序排列,并合理分配纠错预算以抑制误纠放大效应,在严格冗余约束下实现选择性纠错。评估表明,该方案在减少冗余开销的同时增强了对集群故障与外围故障的鲁棒性,凸显了面向下一代内存系统(如定制版HBM)实施协同跨层级防护的重要性。