The demand for efficient processing of deep neural networks (DNNs) on embedded devices is a significant challenge limiting their deployment. Exploiting sparsity in the network's feature maps is one of the ways to reduce its inference latency. It is known that unstructured sparsity results in lower accuracy degradation with respect to structured sparsity but the former needs extensive inference engine changes to get latency benefits. To tackle this challenge, we propose a solution to induce semi-structured activation sparsity exploitable through minor runtime modifications. To attain high speedup levels at inference time, we design a sparse training procedure with awareness of the final position of the activations while computing the General Matrix Multiplication (GEMM). We extensively evaluate the proposed solution across various models for image classification and object detection tasks. Remarkably, our approach yields a speed improvement of $1.25 \times$ with a minimal accuracy drop of $1.1\%$ for the ResNet18 model on the ImageNet dataset. Furthermore, when combined with a state-of-the-art structured pruning method, the resulting models provide a good latency-accuracy trade-off, outperforming models that solely employ structured pruning techniques.
翻译:深度神经网络在嵌入式设备上的高效处理需求限制了其部署,这是一个重大挑战。利用网络特征图中的稀疏性是降低推理延迟的方法之一。已知非结构化稀疏性相比结构化稀疏性在精度下降方面表现更优,但前者需要大量推理引擎的改动才能获得延迟收益。为应对这一挑战,我们提出了一种解决方案,通过引入半结构化激活稀疏性,仅需对运行时进行微小修改即可利用该特性。为了在推理时实现高加速比,我们设计了一种稀疏训练流程,该流程在计算通用矩阵乘法(GEMM)时考虑了激活值的最终位置。我们针对图像分类和目标检测任务的各种模型对提出的解决方案进行了广泛评估。值得注意的是,我们的方法在ImageNet数据集上对ResNet18模型实现了1.25倍的加速,同时精度仅下降1.1%。此外,与当前最先进的结构化剪枝方法结合时,所得模型提供了良好的延迟-精度权衡,其性能优于仅采用结构化剪枝技术的模型。