Compute-in-memory (CiM) has emerged as a compelling solution to alleviate high data movement costs in von Neumann machines. CiM can perform massively parallel general matrix multiplication (GEMM) operations in memory, the dominant computation in Machine Learning (ML) inference. However, re-purposing memory for compute poses key questions on 1) What type of CiM to use: Given a multitude of analog and digital CiMs, determining their suitability from systems perspective is needed. 2) When to use CiM: ML inference includes workloads with a variety of memory and compute requirements, making it difficult to identify when CiM is more beneficial than standard processing cores. 3) Where to integrate CiM: Each memory level has different bandwidth and capacity, that affects the data movement and locality benefits of CiM integration. In this paper, we explore answers to these questions regarding CiM integration for ML inference acceleration. We use Timeloop-Accelergy for early system-level evaluation of CiM prototypes, including both analog and digital primitives. We integrate CiM into different cache memory levels in an Nvidia A100-like baseline architecture and tailor the dataflow for various ML workloads. Our experiments show CiM architectures improve energy efficiency, achieving up to 0.12x lower energy than the established baseline with INT-8 precision, and upto 4x performance gains with weight interleaving and duplication. The proposed work provides insights into what type of CiM to use, and when and where to optimally integrate it in the cache hierarchy for GEMM acceleration.
翻译:存内计算(CiM)已成为缓解冯·诺依曼架构中高数据搬运成本的有效解决方案。CiM可在存储器中执行大规模并行的通用矩阵乘法(GEMM)运算,这是机器学习(ML)推理中的主导计算任务。然而,将存储器改造用于计算带来了三个关键问题:1)使用何种CiM类型:面对多种模拟与数字CiM方案,需要从系统角度确定其适用性;2)何时使用CiM:ML推理包含具有不同存储与计算需求的工作负载,难以界定CiM何时比标准处理核心更具优势;3)何处集成CiM:不同存储层级具有差异化的带宽与容量,这会影响CiM集成所带来的数据搬运与局部性优化效益。本文针对ML推理加速中的CiM集成问题探索上述答案。我们采用Timeloop-Accelergy工具对包括模拟与数字原语在内的CiM原型进行早期系统级评估,并在类Nvidia A100基准架构的不同缓存层级中集成CiM,针对各类ML工作负载定制数据流。实验表明,CiM架构可提升能效:采用INT-8精度时,能量消耗较基准方案最低降低0.12倍;通过权重交织与复制技术,性能最高提升4倍。本研究为GEMM加速提供了关于CiM类型选择及其在缓存层级中最优集成时机与位置的见解。