In this paper, we present a dynamically reconfigurable hardware accelerator called FADES (Fused Architecture for DEnse and Sparse matrices). The FADES design offers multiple configuration options that trade off parallelism and complexity using a dataflow model to create four stages that read, compute, scale and write results. FADES is mapped to the programmable logic (PL) and integrated with the TensorFlow Lite inference engine running on the processing system (PS) of a heterogeneous SoC device. The accelerator is used to compute the tensor operations, while the dynamically reconfigurable approach can be used to switch precision between int8 and float modes. This dynamic reconfiguration enables better performance by allowing more cores to be mapped to the resource-constrained device and lower power consumption compared with supporting both arithmetic precisions simultaneously. We compare the proposed hardware with a high-performance systolic architecture for dense matrices obtaining 25% better performance in dense mode with half the DSP blocks in the same technology. In sparse mode, we show that the core can outperform dense mode even at low sparsity levels, and a single-core achieves up to 20x acceleration over the software-optimized NEON RUY library.
翻译:本文提出了一种名为FADES(稠密与稀疏矩阵融合架构)的动态可重构硬件加速器。FADES设计提供多种配置选项,通过数据流模型在并行度和复杂度之间进行权衡,构建了四个阶段用于读取、计算、缩放和输出结果。FADES被映射到可编程逻辑(PL)中,并与运行于异构SoC器件处理系统(PS)上的TensorFlow Lite推理引擎集成。该加速器负责执行张量运算,而动态可重构方法可在int8与float模式间切换精度。与同时支持两种算术精度相比,这种动态重构通过将更多计算核映射到资源受限器件上实现更优性能,并降低功耗。我们将所提硬件与高性能脉动阵列架构进行对比,在相同工艺下,稠密模式性能提升25%,且DSP块使用量减半。在稀疏模式下,即使稀疏度较低,该计算核仍能超越稠密模式性能,单核相比软件优化的NEON RUY库实现最高20倍加速。