Compute-in-memory (CiM) has emerged as a highly energy efficient solution for performing matrix multiplication during Machine Learning (ML) inference. However, integrating compute in memory poses key questions, such as 1) What type of CiM to use: Given a multitude of CiM design characteristics, determining their suitability from architecture perspective is needed. 2) When to use CiM: ML inference includes workloads with a variety of memory and compute requirements, making it difficult to identify when CiM is more beneficial. 3) Where to integrate CiM: Each memory level has different bandwidth and capacity, creating different data reuse opportunities for CiM integration. To answer such questions regarding on-chip CiM integration for accelerating ML workloads, we use an analytical architecture evaluation methodology where we tailor the dataflow mapping. The mapping algorithm aims to achieve highest weight reuse and reduced data movements for a given CiM prototype and workload. Our experiments show that CiM integrated memory improves energy efficiency by up to 3.4x and throughput by up to 15.6x compared to tensor-core-like baseline architecture, with INT-8 precision under iso-area constraints. We believe the proposed work provides insights into what type of CiM to use, and when and where to optimally integrate it in the cache hierarchy for efficient matrix multiplication.
翻译:存内计算(CiM)已成为机器学习推理过程中执行矩阵乘法的一种高能效解决方案。然而,集成存内计算面临几个关键问题:1)应使用何种类型的CiM:鉴于CiM设计特性的多样性,需要从架构角度评估其适用性。2)何时使用CiM:机器学习推理任务具有多样化的存储与计算需求,难以判断CiM何时更具优势。3)在何处集成CiM:不同存储层级具有不同的带宽与容量,为CiM集成提供了差异化的数据重用机会。为回答此类关于片上CiM集成以加速机器学习负载的问题,我们采用一种分析性架构评估方法,通过定制数据流映射来实现优化。该映射算法旨在针对给定的CiM原型和工作负载,实现最高的权重重用率并减少数据移动。实验表明,在等面积约束下采用INT-8精度时,集成CiM的存储器相较于类张量核基线架构,能效提升最高达3.4倍,吞吐量提升最高达15.6倍。我们相信,本研究为选择何种CiM类型、何时在缓存层次结构中集成以及如何优化集成以实现高效矩阵乘法提供了重要见解。