The demand for efficient machine learning (ML) accelerators is growing rapidly, driving the development of novel computing concepts such as resistive random access memory (RRAM)-based tiled computing-in-memory (CIM) architectures. CIM allows to compute within the memory unit, resulting in faster data processing and reduced power consumption. Efficient compiler algorithms are essential to exploit the potential of tiled CIM architectures. While conventional ML compilers focus on code generation for CPUs, GPUs, and other von Neumann architectures, adaptations are needed to cover CIM architectures. Cross-layer scheduling is a promising approach, as it enhances the utilization of CIM cores, thereby accelerating computations. Although similar concepts are implicitly used in previous work, there is a lack of clear and quantifiable algorithmic definitions for cross-layer scheduling for tiled CIM architectures. To close this gap, we present CLSA-CIM, a cross-layer scheduling algorithm for tiled CIM architectures. We integrate CLSA-CIM with existing weight-mapping strategies and compare performance against state-of-the-art (SOTA) scheduling algorithms. CLSA-CIM improves the utilization by up to 17.9 x , resulting in an overall speedup increase of up to 29.2 x compared to SOTA.
翻译:对高效机器学习加速器的需求日益增长,推动了忆阻器(RRAM)基瓦片化存内计算(CIM)架构等新型计算概念的发展。CIM允许在存储单元内进行计算,从而实现更快速的数据处理并降低功耗。为发挥瓦片化CIM架构的潜力,需要高效的编译器算法。传统ML编译器专注于为CPU、GPU及其他冯·诺依曼架构生成代码,而针对CIM架构需进行适配改进。跨层调度作为一种有前景的方法,可提升CIM核心的利用率,进而加速计算。尽管此前工作中已隐含使用类似概念,但针对瓦片化CIM架构的跨层调度仍缺乏清晰且可量化的算法定义。为弥补这一空白,我们提出CLSA-CIM——一种面向瓦片化CIM架构的跨层调度算法。我们将CLSA-CIM与现有权重映射策略集成,并与当前最优(SOTA)调度算法进行性能对比。结果表明,相较SOTA算法,CLSA-CIM将利用率最高提升17.9倍,总体加速比最高提升29.2倍。