Dataflow scheduling decisions are of vital importance to neural network (NN) accelerators. Recent scalable NN accelerators support a rich set of advanced dataflow techniques. The problems of comprehensively representing and quickly finding optimized dataflow schemes thus become significantly more complicated and challenging. In this work, we first propose comprehensive and pragmatic dataflow representations for temporal and spatial scheduling on scalable multi-node NN architectures. An informal hierarchical taxonomy highlights the tight coupling across different levels of the dataflow space as the major difficulty for fast design exploration. A set of formal tensor-centric directives accurately express various inter-layer and intra-layer schemes, and allow for quickly determining their validity and efficiency. We then build a generic, optimized, and fast dataflow solver, KAPLA, which makes use of the pragmatic directives to explore the design space with effective validity check and efficiency estimation. KAPLA decouples the upper inter-layer level for fast pruning, and solves the lower intra-layer schemes with a novel bottom-up cost descending method. KAPLA achieves within only 2.2% and 7.7% energy overheads on the result dataflow for training and inference, respectively, compared to the exhaustively searched optimal schemes. It also outperforms random and machine-learning-based approaches, with more optimized results and orders of magnitude faster search speedup.
翻译:数据流调度决策对于神经网络加速器至关重要。近年来可扩展神经网络加速器支持多种高级数据流技术,使得全面表示和快速寻找优化数据流方案的问题变得极为复杂且具有挑战性。本文首先针对可扩展多节点神经网络架构上的时间与空间调度,提出全面且实用的数据流表示方法。非形式化层级分类法揭示了数据流空间不同层级间的紧密耦合是快速设计探索的主要难点。基于张量的形式化指令集能够精确表达各类跨层与层内方案,并快速判定其有效性与效率。随后我们构建了通用、优化且快速的数据流求解器KAPLA,该求解器利用实用化指令集,通过有效性检查与效率评估进行设计空间探索。KAPLA通过上层跨层解耦实现快速剪枝,并采用新颖的自底向上成本递减方法求解下层层内方案。与穷举搜索的最优方案相比,KAPLA在训练与推理结果数据流上分别仅产生2.2%和7.7%的额外能耗开销,同时其优化结果优于随机搜索及基于机器学习的方法,且搜索加速比提升数个数量级。