Various processing-in-memory (PIM) accelerators based on various devices, micro-architectures, and interfaces have been proposed to accelerate deep neural networks (DNNs). How to deploy DNNs onto PIM-based accelerators is the key to explore PIM's high performance and energy efficiency. The scale of DNN models, the diversity of PIM accelerators, and the complexity of deployment are far beyond the human deployment capability. Hence, an automatic deployment methodology is indispensable. In this work, we propose PIMCOMP, an end-to-end DNN compiler tailored for PIM accelerators, achieving efficient deployment of DNN models on PIM hardware. PIMCOMP can adapt to various PIM architectures by using an abstract configurable PIM accelerator template with a set of pseudo-instructions, which is a high-level abstraction of the hardware's fundamental functionalities. Through a generic multi-level optimization framework, PIMCOMP realizes an end-to-end conversion from a high-level DNN description to pseudo-instructions, which can be further converted to specific hardware intrinsics/primitives. The compilation addresses two critical issues in PIM-accelerated inference from a system perspective: resource utilization and dataflow scheduling. PIMCOMP adopts a flexible unfolding format to reshape and partition convolutional layers, adopts a weight-layout guided computation-storage-mapping approach to enhance resource utilization, and balances the system's computation, memory access, and communication characteristics. For dataflow scheduling, we design two scheduling algorithms with different inter-layer pipeline granularities to support varying application scenarios while ensuring high computational parallelism. Experiments demonstrate that PIMCOMP improves throughput, latency, and energy efficiency across various architectures. PIMCOMP is open-sourced at \url{https://github.com/sunxt99/PIMCOMP-NN}.
翻译:基于多种器件、微架构和接口的各类存内处理(PIM)加速器已被提出用于加速深度神经网络(DNN)。如何将DNN部署到基于PIM的加速器上是发挥PIM高性能与高能效的关键。DNN模型的规模、PIM加速器的多样性以及部署的复杂性已远超人工部署能力,因此自动化部署方法不可或缺。本研究提出PIMCOMP——一个专为PIM加速器设计的端到端DNN编译器,实现了DNN模型在PIM硬件上的高效部署。PIMCOMP通过采用具有一组伪指令的抽象可配置PIM加速器模板,能够适配多种PIM架构,该模板是对硬件基础功能的高层抽象。通过通用的多级优化框架,PIMCOMP实现了从高层DNN描述到伪指令的端到端转换,这些伪指令可进一步转换为特定硬件内部指令/原语。该编译方案从系统视角解决了PIM加速推理中的两个关键问题:资源利用率和数据流调度。PIMCOMP采用灵活展开格式对卷积层进行重塑与划分,采用权重布局指导的计算-存储映射方法提升资源利用率,并平衡系统的计算、访存与通信特性。针对数据流调度,我们设计了两种具有不同层间流水线粒度的调度算法,在保证高计算并行度的同时支持多样化的应用场景。实验表明,PIMCOMP在不同架构上均能提升吞吐率、降低延迟并提高能效。PIMCOMP已在\url{https://github.com/sunxt99/PIMCOMP-NN}开源。