Progress in artificial intelligence and machine learning over the past decade has been driven by the ability to train larger deep neural networks (DNNs), leading to a compute demand that far exceeds the growth in hardware performance afforded by Moore's law. Training DNNs is an extremely memory-intensive process, requiring not just the model weights but also activations and gradients for an entire minibatch to be stored. The need to provide high-density and low-leakage on-chip memory motivates the exploration of emerging non-volatile memory for training accelerators. Spin-Transfer-Torque MRAM (STT-MRAM) offers several desirable properties for training accelerators, including 3-4x higher density than SRAM, significantly reduced leakage power, high endurance and reasonable access time. On the one hand, MRAM write operations require high write energy and latency due to the need to ensure reliable switching. In this study, we perform a comprehensive device-to-system evaluation and co-optimization of STT-MRAM for efficient ML training accelerator design. We devised a cross-layer simulation framework to evaluate the effectiveness of STT-MRAM as a scratchpad replacing SRAM in a systolic-array-based DNN accelerator. To address the inefficiency of writes in STT-MRAM, we propose to reduce write voltage and duration. To evaluate the ensuing accuracy-efficiency trade-off, we conduct a thorough analysis of the error tolerance of input activations, weights, and errors during the training. We propose heterogeneous memory configurations that enable training convergence with good accuracy. We show that MRAM provide up to 15-22x improvement in system level energy across a suite of DNN benchmarks under iso-capacity and iso-area scenarios. Further optimizing STT-MRAM write operations can provide over 2x improvement in write energy for minimal degradation in application-level training accuracy.
翻译:过去十年间,人工智能与机器学习的进步得益于训练更大规模深度神经网络(DNNs)的能力,由此产生的计算需求远超摩尔定律所预测的硬件性能增长。训练DNN是极度内存密集型过程,不仅需要存储模型权重,还需存储整个小批次的激活值与梯度。为满足高密度、低功耗片上内存的需求,促使研究人员探索将新型非易失性存储器用于训练加速器。自旋转移矩MRAM(STT-MRAM)为训练加速器提供了多项理想特性:其密度比SRAM高3-4倍,泄漏功耗显著降低,兼具高耐久性与合理的访问时间。然而,MRAM写入操作需确保可靠翻转,导致写入能耗与延迟较高。本研究从器件到系统层面,对STT-MRAM进行综合评估与协同优化,旨在实现高效的机器学习训练加速器设计。我们构建了跨层仿真框架,评估STT-MRAM在基于脉动阵列的DNN加速器中替代SRAM作为暂存器的有效性。针对STT-MRAM写入效率低的问题,我们提出降低写入电压与持续时间的方案。为量化随之产生的精度-效率权衡,我们深入分析了训练过程中输入激活值、权重及误差的容错特性。我们提出异构内存配置方案,使训练能够以良好精度收敛。实验表明,在等容量与等面积场景下,MRAM可在全量DNN基准测试套件中实现15-22倍的系统级能效提升。进一步优化STT-MRAM写入操作,可在应用级训练精度损失极小的情况下,将写入能耗提升2倍以上。