As the CMOS technology pushes to the nanoscale, aging effects and process variations have become increasingly pronounced, posing significant reliability challenges for AI accelerators. Traditional guardband-based design approaches, which rely on pessimistic timing margin, sacrifice significant performance and computational efficiency, rendering them inadequate for high-performance AI computing demands. Current reliability-aware AI accelerator design faces two core challenges: (1) the lack of systematic cross-layer analysis tools to capture coupling reliability effects across device, circuit, architecture, and application layers; and (2) the fundamental trade-off between conventional reliability optimization and computational efficiency. To address these challenges, this paper systematically presents a series of reliability-aware accelerator designs, encompassing (1) aging and variation-aware dynamic timing analyzer, (2) accelerator dataflow optimization using critical input pattern reduction, and (3) resilience characterization and novel architecture design for large language models (LLMs). By tightly integrating cross-layer reliability modeling and AI workload characteristics, these co-optimization approaches effectively achieve reliable and efficient AI acceleration.
翻译:随着CMOS技术进入纳米尺度,老化效应与工艺偏差日益显著,为AI加速器带来了严峻的可靠性挑战。传统基于防护带的设计方法依赖保守的时序裕量,牺牲了显著的性能与计算效率,已无法满足高性能AI计算的需求。当前可靠性感知的AI加速器设计面临两大核心挑战:(1)缺乏系统化的跨层分析工具以捕捉器件、电路、架构与应用层间的耦合可靠性效应;(2)传统可靠性优化与计算效率之间存在根本性权衡。为应对这些挑战,本文系统性地提出了一系列可靠性感知的加速器设计方案,包括:(1)老化与偏差感知的动态时序分析器;(2)基于关键输入模式缩减的加速器数据流优化;(3)面向大语言模型(LLMs)的容错特性刻画与新型架构设计。通过紧密整合跨层可靠性建模与AI工作负载特性,这些协同优化方法有效实现了可靠且高效的AI加速。