System and Design Technology Co-optimization of SOT-MRAM for High-Performance AI Accelerator Memory System

SoCs are now designed with their own AI accelerator segment to accommodate the ever-increasing demand of Deep Learning (DL) applications. With powerful MAC engines for matrix multiplications, these accelerators show high computing performance. However, because of limited memory resources (i.e., bandwidth and capacity), they fail to achieve optimum system performance during large batch training and inference. In this work, we propose a memory system with high on-chip capacity and bandwidth to shift the gear of AI accelerators from memory-bound to achieving system-level peak performance. We develop the memory system with DTCO-enabled customized SOT-MRAM as large on-chip memory through STCO and detailed characterization of the DL workloads. %We evaluate our workload-aware memory system on the CV and NLP benchmarks and observe significant PPA improvement compared to an SRAM-based in both inference and training modes. Our workload-aware memory system achieves 8X energy and 9X latency improvement on Computer Vision (CV) benchmarks in training and 8X energy and 4.5X latency improvement on Natural Language Processing (NLP) benchmarks in training while consuming only around 50% of SRAM area at iso-capacity.

翻译：为满足深度学习应用日益增长的需求，系统级芯片现已集成专用AI加速器模块。尽管这些加速器通过强大的乘累加引擎实现矩阵乘法运算，展现出卓越的计算性能，但在大规模批次训练与推理过程中，受限于存储资源（如带宽与容量），难以达到最佳系统性能。本文提出一种具备高片上容量与高带宽的存储系统，旨在将AI加速器从内存受限状态转向实现系统级峰值性能。我们采用设计-工艺协同优化定制化自旋轨道矩磁随机存储器作为大容量片上存储方案，通过系统-工艺协同优化框架与深度学习工作负载的精细化特征分析，开发该存储系统。实验表明，与基于静态随机存储器的方案相比，我们提出的工作负载感知存储系统在训练与推理模式下均实现显著性能-功耗-面积提升：在计算机视觉基准测试中，训练模式能量效率提升8倍、延迟降低9倍；在自然语言处理基准测试中，训练模式能量效率提升8倍、延迟降低4.5倍，且同等容量下芯片面积仅占SRAM的约50%。

相关内容

Performance

关注 3

Performance：International Symposium on Computer Performance Modeling, Measurements and Evaluation。 Explanation：计算机性能建模、测量和评估国际研讨会。 Publisher：ACM。 SIT：http://dblp.uni-trier.de/db/conf/performance/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日