Modern Deep Learning (DL) workloads are increasingly deployed in safety-critical domains, such as automotive systems and hyperscale data centers, where transient hardware faults pose a serious threat to system reliability. These workloads are highly memory-intensive, and their correct functionality strongly depends on model parameters stored in memory, which are typically protected using Error Correction Codes (ECCs). In this work, we study ECC's impact on such models and propose two lightweight alternatives to ECCs that achieve superior reliability. The first approach, MSET, selectively hardens the most vulnerable bits in CNN and ViT parameters, while the second approach, CEP, provides fine-grained protection for all parameter bits. Experimental results demonstrate that both methods significantly enhance the reliability of large CNNs and ViTs, mostly outperforming conventional Single Error Detection Double Error Correction (SECDED) ECC schemes, with no memory overhead and, in fact, with considerably lower area and delay characteristics when compared to SECDEC. Experimental results indicate that ViTs can be effectively protected by merely protecting their highest exponent bits in FP16 and FP32 representations. Furthermore, applying the CEP technique can guarantee the resilience of DNNs by up to one order of magnitude higher BERs, with a 3.5x lower area overhead and 7x faster decoder compared to SECDED ECC.
翻译:现代深度学习工作负载日益部署于安全关键领域,例如汽车系统和超大规模数据中心,在这些场景中,瞬时硬件故障对系统可靠性构成严重威胁。此类工作负载具有高度内存密集型特征,其正确功能强烈依赖于存储在内存中的模型参数,这些参数通常使用纠错码进行保护。本文研究了ECC对此类模型的影响,并提出了两种优于ECC的低开销轻量级替代方案,以实现更高的可靠性。第一种方法MSET选择性地加固CNN和ViT参数中最易出错的比特位,第二种方法CEP则为所有参数比特位提供细粒度保护。实验结果表明,两种方法均能显著提升大型CNN和ViT的可靠性,在大多数情况下优于传统单比特纠错双比特检错ECC方案,且具有零内存开销,与SECDED ECC相比,其面积开销和延迟特性显著降低。实验结果显示,仅保护FP16和FP32表示中ViT的最高指数位即可有效保护ViT。此外,应用CEP技术可确保DNN在误码率高达一个数量级的情况下仍保持鲁棒性,其面积开销仅为SECDED ECC的3.5倍,解码速度提升7倍。