Although code generation for Convolution Neural Network (CNN) models has been extensively studied, performing efficient data slicing and parallelization for highly-constrai\-ned Multicore Neural Processor Units (NPUs) is still a challenging problem. Given the size of convolutions' input/output tensors and the small footprint of NPU on-chip memories, minimizing memory transactions while maximizing parallelism and MAC utilization are central to any effective solution. This paper proposes a TensorFlow XLA/LLVM compiler optimization pass for Multicore NPUs, called Tensor Slicing Optimization (TSO), which: (a) maximizes convolution parallelism and memory usage across NPU cores; and (b) reduces data transfers between host and NPU on-chip memories by using DRAM memory burst time estimates to guide tensor slicing. To evaluate the proposed approach, a set of experiments was performed using the NeuroMorphic Processor (NMP), a multicore NPU containing 32 RISC-V cores extended with novel CNN instructions. Experimental results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models. Speed-ups of up to 21.7\% result when comparing the TSO burst-based technique to a no-burst data slicing approach. To validate the generality of the TSO approach, the algorithm was also ported to the Glow Machine Learning framework. The performance of the models were measured on both Glow and TensorFlow XLA/LLVM compilers, revealing similar results.
翻译:尽管卷积神经网络(CNN)模型的代码生成已被广泛研究,但在高度约束的多核神经处理单元(NPU)上实现高效的数据切片与并行化仍具挑战性。鉴于卷积输入/输出张量的规模以及NPU片上存储器的有限容量,如何在最大化并行性与MAC利用率的同时最小化内存事务,是任何有效解决方案的核心问题。本文提出一种面向多核NPU的TensorFlow XLA/LLVM编译器优化步骤,称为张量切片优化(TSO),该方案能够:(a)最大化跨NPU核的卷积并行性与内存使用效率;(b)通过利用DRAM突发传输时间估计指导张量切片,减少主机与NPU片上存储器间的数据传输。为评估所提方法,我们使用神经形态处理器(NMP)——一种包含32个扩展了新型CNN指令的RISC-V核的多核NPU——开展了一系列实验。实验结果表明,TSO能够识别最佳张量切片方案,从而最小化一组CNN模型的执行时间。与无突发数据切片方法相比,基于TSO突发传输的技术可实现高达21.7%的加速比。为验证TSO方法的普适性,该算法还被移植至Glow机器学习框架。在Glow与TensorFlow XLA/LLVM两种编译器上对模型性能进行测量,结果显示两者性能表现相似。