Because of the recent trends in Deep Neural Networks (DNN) models being memory-bound, inter-operator pipelining for DNN accelerators is emerging as a promising optimization. Inter-operator pipelining reduces costly on-chip global memory and off-chip memory accesses by forwarding the output of a layer as the input of the next layer within the compute array, which is proven to be an effective optimization by previous works. However, the design space of inter-operator pipelining is huge, and the space is not yet fully explored. In particular, identifying the right depth and granularity of pipelining (or no pipelining at all) is significantly dependent on the layer shapes and data volumes of weights and activations, and these are different even within a domain. Moreover, works divide the substrate into large chunks and map one layer onto each chunk, which requires communicating halfway through or through the global buffer. However, for fine-grained inter-operation pipelining, placing the corresponding consumer of the next layer tile close to the producer tile of the current layer is a better way to exploit fine-grained spatial reuse. In order to support variable number of layers (ie the right depth) and support multiple spatial organizations of layers (in accordance with the pipelining granularity) on the substrate, we propose PipeOrgan, a new class of spatial data organization strategy for energy efficient and congestion-free communication between the PEs for various pipeline depth and granularity. PipeOrgan takes advantage of flexible spatial organization and can allocate layers to PEs based on the granularity of pipelining. We also propose changes to the conventional mesh topology to improve the performance of coarse-grained allocation. PipeOrgan achieves 1.95x performance improvement over the state-of-the-art pipelined dataflow on XR-bench workloads.
翻译:由于深度神经网络模型近期呈现内存受限趋势,针对DNN加速器的跨算子流水线正成为一种颇具前景的优化方法。跨算子流水线通过将计算阵列中某层的输出直接转发为下一层的输入,减少昂贵的片上全局存储器与片外存储器访问次数——已有研究证明这是一种有效的优化手段。然而,跨算子流水线的设计空间极为庞大且尚未被充分探索。具体而言,流水线的深度与粒度(或无流水线)的选择高度依赖于层的形状及权重/激活值的数据量,这些特征即便在同一领域内也存在差异。此外,现有研究将基底划分为大块并逐块映射单层,这需要经由全局缓冲区或中途进行通信。但对于细粒度跨算子流水线而言,将下一层瓦片的对应消费者紧邻当前层生产者瓦片摆放,是更优的细粒度空间复用方案。为支持基底上可变层数(即最优深度)及多种层空间组织方式(与流水线粒度对应),我们提出PipeOrgan——一种新型空间数据组织策略,可在不同流水线深度与粒度下实现处理器单元间的高能效无拥塞通信。PipeOrgan利用灵活空间组织特性,能根据流水线粒度将层分配到处理器单元。我们还对传统网格拓扑提出改进方案,以提升粗粒度分配性能。在XR-bench基准测试中,PipeOrgan相较现有最优流水线数据流获得了1.95倍的性能提升。