The inherent diversity of computation types within the deep neural network (DNN) models often requires a variety of specialized units in hardware processors, which limits computational efficiency, increasing both inference latency and power consumption, especially when the hardware processor needs to support and execute different neural networks. In this study, we introduce NeuralMatrix, which elastically transforms the computations of entire DNNs into linear matrix operations. This transformation allows seamless execution of various DNN models all with matrix operations and paves the way for running versatile DNN models with a single General Matrix Multiplication (GEMM) accelerator.Extensive experiments with both CNN and transformer-based models demonstrate the potential of NeuralMatrix to accurately and efficiently execute a wide range of DNN models, achieving 2.17-38.72 times computation efficiency (i.e., throughput per power) compared to CPUs, GPUs, and SoC platforms. This level of efficiency is usually only attainable with the accelerator designed for a specific neural network.
翻译:深度神经网络(DNN)模型内部计算类型的固有多样性通常要求硬件处理器配备多种专用计算单元,这限制了计算效率,增加了推理延迟和功耗,尤其是在硬件处理器需要支持并执行不同神经网络时。本研究提出NeuralMatrix,它将整个DNN的计算弹性地转换为线性矩阵运算。这种转换使得各类DNN模型均可无缝地通过矩阵运算执行,并为使用单一通用矩阵乘法(GEMM)加速器运行多样化DNN模型铺平了道路。基于CNN和Transformer模型的广泛实验表明,NeuralMatrix能够准确高效地执行广泛的DNN模型,与CPU、GPU及SoC平台相比,其计算效率(即单位功耗吞吐量)达到2.17-38.72倍。这种效率水平通常仅在设计用于特定神经网络的专用加速器上才能实现。