Deep learning and signal processing are closely correlated in many IoT scenarios such as anomaly detection to empower intelligence of things. Many IoT processors utilize digital signal processors (DSPs) for signal processing and build deep learning frameworks on this basis. While deep learning is usually much more computing-intensive than signal processing, the computing efficiency of deep learning on DSPs is limited due to the lack of native hardware support. In this case, we present a contrary strategy and propose to enable signal processing on top of a classical deep learning accelerator (DLA). With the observation that irregular data patterns such as butterfly operations in FFT are the major barrier that hinders the deployment of signal processing on DLAs, we propose a programmable data shuffling fabric and have it inserted between the input buffer and computing array of DLAs such that the irregular data is reorganized and the processing is converted to be regular. With the online data shuffling, the proposed architecture, SigDLA, can adapt to various signal processing tasks without affecting the deep learning processing. Moreover, we build a reconfigurable computing array to suit the various data width requirements of both signal processing and deep learning. According to our experiments, SigDLA achieves an average performance speedup of 4.4$\times$, 1.4$\times$, and 1.52$\times$, and average energy reduction of 4.82$\times$, 3.27$\times$, and 2.15$\times$ compared to an embedded ARM processor with customized DSP instructions, a DSP processor, and an independent DSP-DLA architecture respectively with 17% more chip area over the original DLAs.
翻译:深度学习与信号处理在诸多物联网场景(如异常检测)中紧密关联,共同赋能万物智能。许多物联网处理器利用数字信号处理器进行信号处理,并在此基础上构建深度学习框架。然而,深度学习通常比信号处理需要更高的计算强度,而DSP因缺乏原生硬件支持,其深度学习计算效率受限。对此,我们提出一种逆向策略:在经典深度学习加速器之上实现信号处理。通过观察发现,FFT中蝶形运算等不规则数据模式是阻碍信号处理在DLA上部署的主要障碍。为此,我们设计了一种可编程数据重排结构,将其嵌入DLA的输入缓冲区与计算阵列之间,从而重组不规则数据、将处理过程规整化。借助在线数据重排,所提出的SigDLA架构能适配多种信号处理任务,且不影响深度学习处理。此外,我们构建了可重构计算阵列,以满足信号处理与深度学习对数据宽度的多样化需求。实验表明:相较于采用定制DSP指令的嵌入式ARM处理器、DSP处理器及独立DSP-DLA架构,SigDLA在芯片面积仅增加17%的前提下,平均性能分别提升4.4倍、1.4倍和1.52倍,平均能耗分别降低4.82倍、3.27倍和2.15倍。