The effectiveness of Recurrent Neural Networks (RNNs) for tasks such as Automatic Speech Recognition has fostered interest in RNN inference acceleration. Due to the recurrent nature and data dependencies of RNN computations, prior work has designed customized architectures specifically tailored to the computation pattern of RNN, getting high computation efficiency for certain chosen model sizes. However, given that the dimensionality of RNNs varies a lot for different tasks, it is crucial to generalize this efficiency to diverse configurations. In this work, we identify adaptiveness as a key feature that is missing from today's RNN accelerators. In particular, we first show the problem of low resource-utilization and low adaptiveness for the state-of-the-art RNN implementations on GPU, FPGA and ASIC architectures. To solve these issues, we propose an intelligent tiled-based dispatching mechanism for increasing the adaptiveness of RNN computation, in order to efficiently handle the data dependencies. To do so, we propose Sharp as a hardware accelerator, which pipelines RNN computation using an effective scheduling scheme to hide most of the dependent serialization. Furthermore, Sharp employs dynamic reconfigurable architecture to adapt to the model's characteristics. Sharp achieves 2x, 2.8x, and 82x speedups on average, considering different RNN models and resource budgets, compared to the state-of-the-art ASIC, FPGA, and GPU implementations, respectively. Furthermore, we provide significant energy-reduction with respect to the previous solutions, due to the low power dissipation of Sharp (321 GFLOPS/Watt).
翻译:循环神经网络(RNN)在自动语音识别等任务中的有效性引发了对其推理加速的研究兴趣。由于RNN计算的循环特性和数据依赖关系,先前的工作设计了专门针对RNN计算模式的定制化架构,从而在特定模型规模下获得了高计算效率。然而,鉴于不同任务中RNN的维度差异很大,将这些效率推广到多样化的配置中至关重要。在本工作中,我们识别出适应性是当前RNN加速器所缺失的关键特性。具体而言,我们首先展示了在GPU、FPGA和ASIC架构上最先进的RNN实现所存在的资源利用率低和适应性差的问题。为解决这些问题,我们提出了一种基于智能分块的调度机制,以提高RNN计算的适应性,从而高效处理数据依赖关系。为此,我们设计了硬件加速器Sharp,它采用有效的调度方案对RNN计算进行流水线处理,以隐藏大部分依赖序列化。此外,Sharp采用动态可重构架构以适应模型特性。与最先进的ASIC、FPGA和GPU实现相比,Sharp在不同RNN模型和资源预算下平均实现了2倍、2.8倍和82倍的加速。同时,由于Sharp的低功耗特性(321 GFLOPS/W),与先前解决方案相比,我们实现了显著的能耗降低。