The demand for efficient processing of deep neural networks (DNNs) on embedded devices is a significant challenge limiting their deployment. Exploiting sparsity in the network's feature maps is one of the ways to reduce its inference latency. It is known that unstructured sparsity results in lower accuracy degradation with respect to structured sparsity but the former needs extensive inference engine changes to get latency benefits. To tackle this challenge, we propose a solution to induce semi-structured activation sparsity exploitable through minor runtime modifications. To attain high speedup levels at inference time, we design a sparse training procedure with awareness of the final position of the activations while computing the General Matrix Multiplication (GEMM). We extensively evaluate the proposed solution across various models for image classification and object detection tasks. Remarkably, our approach yields a speed improvement of $1.25 \times$ with a minimal accuracy drop of $1.1\%$ for the ResNet18 model on the ImageNet dataset. Furthermore, when combined with a state-of-the-art structured pruning method, the resulting models provide a good latency-accuracy trade-off, outperforming models that solely employ structured pruning techniques.
翻译:在嵌入式设备上高效处理深度神经网络(DNNs)的需求是其部署面临的一项重大挑战。利用网络特征图中的稀疏性是降低推理延迟的方法之一。已知非结构化稀疏性相比结构化稀疏性导致的精度下降更小,但前者需要大量修改推理引擎才能获得延迟收益。为解决这一挑战,我们提出一种方案,通过极小的运行时修改即可利用半结构化激活稀疏性。为在推理时实现显著加速,我们在计算通用矩阵乘法时设计了一种稀疏训练流程,该流程考虑了激活值的最终位置。我们针对图像分类和目标检测任务,在各种模型上对所提方案进行了全面评估。值得注意的是,我们的方法在ResNet18模型上(基于ImageNet数据集)实现了1.25倍的加速,且精度仅下降1.1%。此外,当与最新的结构化剪枝方法结合使用时,所得模型在延迟与精度之间实现了良好权衡,其表现优于仅使用结构化剪枝技术的模型。