When Small Variations Become Big Failures: Reliability Challenges in Compute-in-Memory Neural Accelerators

Compute-in-memory (CiM) architectures promise significant improvements in energy efficiency and throughput for deep neural network acceleration by alleviating the von Neumann bottleneck. However, their reliance on emerging non-volatile memory devices introduces device-level non-idealities-such as write variability, conductance drift, and stochastic noise-that fundamentally challenge reliability, predictability, and safety, especially in safety-critical applications. This talk examines the reliability limits of CiM-based neural accelerators and presents a series of techniques that bridge device physics, architecture, and learning algorithms to address these challenges. We first demonstrate that even small device variations can lead to disproportionately large accuracy degradation and catastrophic failures in safety-critical inference workloads, revealing a critical gap between average-case evaluations and worst-case behavior. Building on this insight, we introduce SWIM, a selective write-verify mechanism that strategically applies verification only where it is most impactful, significantly improving reliability while maintaining CiM's efficiency advantages. Finally, we explore a learning-centric solution that improves realistic worst-case performance by training neural networks with right-censored Gaussian noise, aligning training assumptions with hardware-induced variability and enabling robust deployment without excessive hardware overhead. Together, these works highlight the necessity of cross-layer co-design for CiM accelerators and provide a principled path toward dependable, efficient neural inference on emerging memory technologies-paving the way for their adoption in safety- and reliability-critical systems.

翻译：存内计算（CiM）架构通过缓解冯·诺依曼瓶颈，为深度神经网络加速提供了显著的能效和吞吐量提升。然而，其对新兴非易失性存储器的依赖引入了器件层面的非理想特性——如写入变异、电导漂移和随机噪声——这些特性从根本上对可靠性、可预测性和安全性构成挑战，尤其在安全关键型应用中。本报告探讨了基于CiM的神经加速器的可靠性极限，并提出了一系列融合器件物理、架构设计与学习算法的技术以应对这些挑战。我们首先证明，即使微小的器件变异也可能导致安全关键型推理任务中出现不成比例的大幅精度下降与灾难性故障，这揭示了平均情况评估与最坏情况行为之间的关键差距。基于此发现，我们提出了SWIM——一种选择性写入验证机制，该机制策略性地仅在影响最大的位置实施验证，在保持CiM效率优势的同时显著提升可靠性。最后，我们探索了一种以学习为核心的解决方案，通过使用右截断高斯噪声训练神经网络来提升实际最坏情况性能，使训练假设与硬件引发的变异特性相匹配，从而在不引入过高硬件开销的前提下实现稳健部署。这些工作共同凸显了CiM加速器跨层级协同设计的必要性，并为在新兴存储器技术上实现可靠、高效的神经推理提供了系统化路径——为其在安全与可靠性关键系统中的实际应用铺平道路。