SoCs are now designed with their own AI accelerator segment to accommodate the ever-increasing demand of Deep Learning (DL) applications. With powerful MAC engines for matrix multiplications, these accelerators show high computing performance. However, because of limited memory resources (i.e., bandwidth and capacity), they fail to achieve optimum system performance during large batch training and inference. In this work, we propose a memory system with high on-chip capacity and bandwidth to shift the gear of AI accelerators from memory-bound to achieving system-level peak performance. We develop the memory system with DTCO-enabled customized SOT-MRAM as large on-chip memory through STCO and detailed characterization of the DL workloads. %We evaluate our workload-aware memory system on the CV and NLP benchmarks and observe significant PPA improvement compared to an SRAM-based in both inference and training modes. Our workload-aware memory system achieves 8X energy and 9X latency improvement on Computer Vision (CV) benchmarks in training and 8X energy and 4.5X latency improvement on Natural Language Processing (NLP) benchmarks in training while consuming only around 50% of SRAM area at iso-capacity.
翻译:为满足深度学习应用日益增长的需求,系统级芯片现已集成专用AI加速器模块。尽管这些加速器通过强大的乘累加引擎实现矩阵乘法运算,展现出卓越的计算性能,但在大规模批次训练与推理过程中,受限于存储资源(如带宽与容量),难以达到最佳系统性能。本文提出一种具备高片上容量与高带宽的存储系统,旨在将AI加速器从内存受限状态转向实现系统级峰值性能。我们采用设计-工艺协同优化定制化自旋轨道矩磁随机存储器作为大容量片上存储方案,通过系统-工艺协同优化框架与深度学习工作负载的精细化特征分析,开发该存储系统。实验表明,与基于静态随机存储器的方案相比,我们提出的工作负载感知存储系统在训练与推理模式下均实现显著性能-功耗-面积提升:在计算机视觉基准测试中,训练模式能量效率提升8倍、延迟降低9倍;在自然语言处理基准测试中,训练模式能量效率提升8倍、延迟降低4.5倍,且同等容量下芯片面积仅占SRAM的约50%。