Processing in-memory (PIM) is promising to accelerate neural networks (NNs) because it minimizes data movement and provides large computational parallelism. Similar to machine learning accelerators, application mapping, which determines the operation scheduling and data layout, plays a critical role in the NN acceleration on PIM. The mapping optimization of previous NN accelerators focused on optimizing the latency of sequential execution. However, PIM accelerators feature a distinct design space of application mapping from conventional NN accelerators, due to the spatial execution of NN layers across different memory locations. This enables opportunities for overlapping execution of consecutive NN layers to improve the latency, where the succeeding layer can start execution before the preceding layer fully completes the computation. In this paper, we propose Fast-OverlaPIM framework that incorporates the computational overlapping optimization into the DNN mapping exploration process on PIM architectures. Fast-OverlaPIM includes analytical algorithms for fast and accurate overlap analysis. Furthermore, it proposes a novel mapping search strategy and a transformation mechanism to enable efficient design space exploration on the overlap-based mapping for the whole network. Our framework demonstrates a significant improvement in runtime performance from 3.4x to 323.1x compared to the previous state-of-the-art overlap-based framework. Our experiments show that Fast-OverlaPIM can efficiently produce mappings that are 4.6x to 18.1x faster than the state-of-the-art mapping optimization framework under the same architecture constraints.
翻译:内存内处理(PIM)因能最小化数据移动并提供大规模计算并行性,在加速神经网络方面前景广阔。与机器学习加速器类似,决定操作调度与数据布局的应用映射在PIM上的神经网络加速中起着关键作用。以往神经网络加速器的映射优化主要集中于优化顺序执行的延迟。然而,由于神经网络层在不同存储位置的空间化执行特性,PIM加速器的应用映射设计空间与传统神经网络加速器存在显著差异。这为通过重叠执行连续神经网络层以降低延迟创造了机会——后续层可以在前序层完全完成计算之前开始执行。本文提出Fast-OverlaPIM框架,该框架将计算重叠优化整合到PIM架构上的深度神经网络映射探索过程中。Fast-OverlaPIM包含用于快速准确重叠分析的解析算法,并提出创新的映射搜索策略与变换机制,以实现面向全网络的基于重叠映射的高效设计空间探索。实验表明,相较于先前基于重叠的先进框架,本框架实现了3.4倍至323.1倍的运行时性能提升。在相同架构约束下,Fast-OverlaPIM能高效生成比当前最先进映射优化框架快4.6倍至18.1倍的映射方案。