Effective and Memory-Efficient Alternatives to ECC for Reliable Large-Scale DNNs

Modern Deep Learning (DL) workloads are increasingly deployed in safety-critical domains, such as automotive systems and hyperscale data centers, where transient hardware faults pose a serious threat to system reliability. These workloads are highly memory-intensive, and their correct functionality strongly depends on model parameters stored in memory, which are typically protected using Error Correction Codes (ECCs). In this work, we study ECC's impact on such models and propose two lightweight alternatives to ECCs that achieve superior reliability. The first approach, MSET, selectively hardens the most vulnerable bits in CNN and ViT parameters, while the second approach, CEP, provides fine-grained protection for all parameter bits. Experimental results demonstrate that both methods significantly enhance the reliability of large CNNs and ViTs, mostly outperforming conventional Single Error Detection Double Error Correction (SECDED) ECC schemes, with no memory overhead and, in fact, with considerably lower area and delay characteristics when compared to SECDEC. Experimental results indicate that ViTs can be effectively protected by merely protecting their highest exponent bits in FP16 and FP32 representations. Furthermore, applying the CEP technique can guarantee the resilience of DNNs by up to one order of magnitude higher BERs, with a 3.5x lower area overhead and 7x faster decoder compared to SECDED ECC.

翻译：现代深度学习工作负载日益部署于安全关键领域，例如汽车系统和超大规模数据中心，在这些场景中，瞬时硬件故障对系统可靠性构成严重威胁。此类工作负载具有高度内存密集型特征，其正确功能强烈依赖于存储在内存中的模型参数，这些参数通常使用纠错码进行保护。本文研究了ECC对此类模型的影响，并提出了两种优于ECC的低开销轻量级替代方案，以实现更高的可靠性。第一种方法MSET选择性地加固CNN和ViT参数中最易出错的比特位，第二种方法CEP则为所有参数比特位提供细粒度保护。实验结果表明，两种方法均能显著提升大型CNN和ViT的可靠性，在大多数情况下优于传统单比特纠错双比特检错ECC方案，且具有零内存开销，与SECDED ECC相比，其面积开销和延迟特性显著降低。实验结果显示，仅保护FP16和FP32表示中ViT的最高指数位即可有效保护ViT。此外，应用CEP技术可确保DNN在误码率高达一个数量级的情况下仍保持鲁棒性，其面积开销仅为SECDED ECC的3.5倍，解码速度提升7倍。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【ETH博士论文】设计高效的深度神经网络：拓扑优化、量化和多任务学习，151页pdf

专知会员服务

54+阅读 · 2023年5月30日