The inference of ML models composed of diverse structures, types, and sizes boils down to the execution of different dataflows (i.e. different tiling, ordering, parallelism, and shapes). Using the optimal dataflow for every layer of workload can reduce latency by up to two orders of magnitude over a suboptimal dataflow. Unfortunately, reconfiguring hardware for different dataflows involves on-chip data layout reordering and datapath reconfigurations, leading to non-trivial overhead that hinders ML accelerators from exploiting different dataflows, resulting in suboptimal performance. To address this challenge, we propose FEATHER, an innovative accelerator that leverages a novel spatial array termed Nest and a novel multi-stage reduction network called BIRRD for performing flexible data reduction with layout reordering under the hood, enabling seamless switching between optimal dataflows with negligible latency and resources overhead. For systematically evaluating the performance interaction between dataflows and layouts, we enhance Timeloop, a state-of-the-art dataflow cost modeling and search framework, with layout assessment capabilities, and term it as Layoutloop. We model FEATHER into Layoutloop and also deploy FEATHER end-to-end on the edge ZCU104 FPGA. FEATHER delivers 1.27~2.89x inference latency speedup and 1.3~6.43x energy efficiency improvement compared to various SoTAs like NVDLA, SIGMA and Eyeriss under ResNet-50 and MobiletNet-V3 in Layoutloop. On practical FPGA devices, FEATHER achieves 2.65/3.91x higher throughput than Xilinx DPU/Gemmini. Remarkably, such performance and energy efficiency enhancements come at only 6% area over a fixed-dataflow Eyeriss-like accelerator. Our code is released at https://github.com/maeri-project/FEATHER.
翻译:由多样化结构、类型和尺寸组成的机器学习模型的推理,最终可归结为不同数据流(即不同的分块方式、执行顺序、并行策略和形状)的执行。为工作负载的每一层使用最优数据流,相比次优数据流可将延迟降低多达两个数量级。然而,为不同数据流重新配置硬件涉及片上数据布局重排序和数据通路重构,这会产生显著的开销,阻碍了机器学习加速器利用不同数据流,从而导致次优性能。为应对这一挑战,我们提出了FEATHER,这是一种创新的加速器。它利用一种称为Nest的新型空间阵列和一种名为BIRRD的新型多级归约网络,在底层执行支持布局重排序的灵活数据归约,从而能以可忽略的延迟和资源开销,在最优数据流之间实现无缝切换。为了系统评估数据流与布局之间的性能交互,我们增强了最先进的数据流成本建模与搜索框架Timeloop,为其增加了布局评估能力,并将其命名为Layoutloop。我们将FEATHER建模到Layoutloop中,并在边缘设备ZCU104 FPGA上端到端地部署了FEATHER。在Layoutloop中,针对ResNet-50和MobileNet-V3模型,与NVDLA、SIGMA和Eyeriss等多种先进方案相比,FEATHER实现了1.27~2.89倍的推理延迟加速和1.3~6.43倍的能效提升。在实际的FPGA设备上,FEATHER的吞吐量分别达到Xilinx DPU和Gemmini的2.65倍和3.91倍。值得注意的是,这些性能和能效的提升,仅需在类似固定数据流Eyeriss的加速器基础上增加6%的面积开销。我们的代码发布于 https://github.com/maeri-project/FEATHER。