The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedups from sparse, low-precision matrix operations on Tensor cores, because of the strict requirements for data layout and lack of support for efficiently manipulating the low-precision integers. We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores. Magicube supports SpMM and SDDMM, two major sparse operations in deep learning with mixed precision. Experimental results on an NVIDIA A100 GPU show that Magicube achieves on average 1.44x (up to 2.37x) speedup over the vendor-optimized library for sparse kernels, and 1.43x speedup over the state-of-the-art with a comparable accuracy for end-to-end sparse Transformer inference.
翻译:指数级增长的模型规模推动了深度学习的持续成功,但也带来了高昂的计算与内存开销。从算法角度来看,模型稀疏化和量化已被研究用于缓解该问题;从架构角度来看,硬件厂商提供了张量核心(Tensor cores)用于加速。然而,由于对数据布局的严格要求以及缺乏对低精度整数高效操作的支持,在张量核心上通过稀疏、低精度矩阵运算获得实际加速效果极具挑战性。我们提出Magicube——一个面向张量核心上低精度整数的高性能稀疏矩阵库。Magicube支持深度学习中两种主要的混合精度稀疏运算:SpMM与SDDMM。在NVIDIA A100 GPU上的实验结果表明,Magicube在稀疏核函数上相较于厂商优化库平均实现1.44倍(最高2.37倍)加速,在端到端稀疏Transformer推理中相较于现有最优方法实现1.43倍加速且精度相当。