Network pruning can reduce the computation cost of deep neural network (DNN) models. However, sparse models often produce randomly-distributed weights to maintain accuracy, leading to irregular computations. Consequently, unstructured sparse models cannot achieve meaningful speedup on commodity hardware built for dense matrix computations. Accelerators are usually modified or designed with structured sparsity-optimized architectures for exploiting sparsity. For example, the Ampere architecture introduces a sparse tensor core, which adopts the 2:4 sparsity pattern. We propose a pruning method that builds upon the insight that matrix multiplication generally breaks the large matrix into multiple smaller tiles for parallel execution. We present the tile-wise sparsity pattern, which maintains a structured sparsity pattern at the tile level for efficient execution but allows for irregular pruning at the global scale to maintain high accuracy. In addition, the tile-wise sparsity is implemented at the global memory level, and the 2:4 sparsity executes at the register level inside the sparse tensor core. We can combine these two patterns into a tile-vector-wise (TVW) sparsity pattern to explore more fine-grained sparsity and further accelerate the sparse DNN models. We evaluate the TVW on the GPU, achieving averages of $1.85\times$, $2.75\times$, and $22.18\times$ speedups over the dense model, block sparsity, and unstructured sparsity.
翻译:网络剪枝能够降低深度神经网络模型的计算成本。然而,稀疏模型通常需要保持随机分布的权重以维持精度,从而导致不规则计算。因此,非结构化稀疏模型无法在专为稠密矩阵计算设计的通用硬件上实现显著加速。加速器通常针对结构化稀疏优化架构进行改造或设计,例如安培架构引入采用2:4稀疏模式的稀疏张量核心。我们提出了一种剪枝方法,其核心洞察在于:矩阵乘法通常将大矩阵分解为多个更小的分块进行并行执行。我们提出了分块级稀疏模式,该模式在分块层面维持结构化稀疏模式以实现高效执行,同时在全局层面允许不规则剪枝以保持高精度。此外,分块级稀疏在全局内存级别实现,而2:4稀疏在稀疏张量核心内部的寄存器级别执行。我们将这两种模式结合为分块向量级稀疏模式,以挖掘更细粒度的稀疏性,进一步加速稀疏深度神经网络模型。我们在GPU上评估了分块向量级稀疏模式,相较于稠密模型、块稀疏和非结构化稀疏,分别实现了平均$1.85\times$、$2.75\times$和$22.18\times$的加速比。