The increasing success and scaling of Deep Learning models demands higher computational efficiency and power. Sparsification can lead to both smaller models as well as higher compute efficiency, and accelerated hardware is becoming available. However, exploiting it efficiently requires kernel implementations, pruning algorithms, and storage formats, to utilize hardware support of specialized sparse vector units. An example of those are the NVIDIA's Sparse Tensor Cores (SPTCs), which promise a 2x speedup. However, SPTCs only support the 2:4 format, limiting achievable sparsity ratios to 50%. We present the V:N:M format, which enables the execution of arbitrary N:M ratios on SPTCs. To efficiently exploit the resulting format, we propose Spatha, a high-performance sparse-library for DL routines. We show that Spatha achieves up to 37x speedup over cuBLAS. We also demonstrate a second-order pruning technique that enables sparsification to high sparsity ratios with V:N:M and little to no loss in accuracy in modern transformers.
翻译:深度学习模型日益成功并不断扩展,这要求更高的计算效率和能效。稀疏化既能减小模型规模,又能提升计算效率,且加速硬件已逐步可用。然而,高效利用稀疏化需要内核实现、剪枝算法和存储格式来充分利用专用稀疏向量单元的硬件支持。NVIDIA的稀疏张量核心(SPTCs)便是范例之一,其承诺可实现2倍加速。但SPTCs仅支持2:4格式,将可达到的稀疏率限制在50%。我们提出V:N:M格式,可在SPTCs上执行任意N:M比例。为高效利用该格式,我们进一步提出Spatha——一个面向深度学习例程的高性能稀疏库。实验表明,Spatha相比cuBLAS可实现高达37倍的加速。我们还提出一种二阶剪枝技术,该技术能利用V:N:M格式实现高稀疏率稀疏化,且在现代Transformer中几乎不损失精度。