The demand for efficient machine learning (ML) accelerators is growing rapidly, driving the development of novel computing concepts such as resistive random access memory (RRAM)-based tiled computing-in-memory (CIM) architectures. CIM allows to compute within the memory unit, resulting in faster data processing and reduced power consumption. Efficient compiler algorithms are essential to exploit the potential of tiled CIM architectures. While conventional ML compilers focus on code generation for CPUs, GPUs, and other von Neumann architectures, adaptations are needed to cover CIM architectures. Cross-layer scheduling is a promising approach, as it enhances the utilization of CIM cores, thereby accelerating computations. Although similar concepts are implicitly used in previous work, there is a lack of clear and quantifiable algorithmic definitions for cross-layer scheduling for tiled CIM architectures. To close this gap, we present CLSA-CIM, a cross-layer scheduling algorithm for tiled CIM architectures. We integrate CLSA-CIM with existing weight-mapping strategies and compare performance against state-of-the-art (SOTA) scheduling algorithms. CLSA-CIM improves the utilization by up to 17.9 x , resulting in an overall speedup increase of up to 29.2 x compared to SOTA.
翻译:对高效机器学习加速器的需求日益增长,推动了新型计算概念的发展,例如基于阻变随机存取存储器的瓦片化存内计算架构。存内计算允许在存储单元内部进行计算,从而加快数据处理速度并降低功耗。高效的编译器算法对于发挥瓦片化存内计算架构的潜力至关重要。尽管传统的机器学习编译器专注于为CPU、GPU及其他冯·诺依曼架构生成代码,但仍需进行相应调整以覆盖存内计算架构。跨层调度是一种极具前景的方法,因为它能提升存内计算核心的利用率,从而加速计算。虽然先前的研究中已隐含地使用了类似概念,但针对瓦片化存内计算架构的跨层调度,目前仍缺乏清晰且可量化的算法定义。为填补这一空白,我们提出了CLSA-CIM——一种面向瓦片化存内计算架构的跨层调度算法。我们将CLSA-CIM与现有的权值映射策略相结合,并对比了其与最先进调度算法的性能。实验表明,与最先进算法相比,CLSA-CIM将利用率最高提升了17.9倍,进而使整体加速比最高提升了29.2倍。