As an emerging type of AI computing accelerator, SRAM Computing-In-Memory (CIM) accelerators feature high energy efficiency and throughput. However, various CIM designs and under-explored mapping strategies impede the full exploration of compute and storage balancing in SRAM-CIM accelerator, potentially leading to significant performance degradation. To address this issue, we propose CIM-Tuner, an automatic tool for hardware balancing and optimal mapping strategy under area constraint via hardware-mapping co-exploration. It ensures universality across various CIM designs through a matrix abstraction of CIM macros and a generalized accelerator template. For efficient mapping with different hardware configurations, it employs fine-grained two-level strategies comprising accelerator-level scheduling and macro-level tiling. Compared to prior CIM mapping, CIM-Tuner's extended strategy space achieves 1.58$\times$ higher energy efficiency and 2.11$\times$ higher throughput. Applied to SOTA CIM accelerators with identical area budget, CIM-Tuner also delivers comparable improvements. The simulation accuracy is silicon-verified and CIM-Tuner tool is open-sourced at https://github.com/champloo2878/CIM-Tuner.git.
翻译:作为新兴的人工智能计算加速器,SRAM存内计算(CIM)加速器具有高能效和高吞吐量的特点。然而,多样化的CIM设计及尚未充分探索的映射策略阻碍了对SRAM-CIM加速器中计算与存储平衡的全面探索,可能导致显著的性能下降。为解决这一问题,我们提出CIM-Tuner——一种通过硬件映射协同探索在面积约束下实现硬件平衡与最优映射策略的自动化工具。该工具通过CIM宏单元的矩阵抽象和通用化加速器模板,确保了对各类CIM设计的普适性。针对不同硬件配置的高效映射需求,该工具采用细粒度双层策略,包含加速器级调度与宏单元级分块。相较于现有CIM映射方案,CIM-Tuner扩展的策略空间实现了1.58$\times$的能效提升与2.11$\times$的吞吐量提升。在相同面积预算下应用于前沿CIM加速器时,CIM-Tuner同样带来可观的性能改进。其仿真精度已通过硅验证,工具代码已在https://github.com/champloo2878/CIM-Tuner.git开源。