Reducing the computational cost of running large scale neural networks using sparsity has attracted great attention in the deep learning community. While much success has been achieved in reducing FLOP and parameter counts while maintaining acceptable task performance, achieving actual speed improvements has typically been much more difficult, particularly on general purpose accelerators (GPAs) such as NVIDIA GPUs using low precision number formats. In this work we introduce PopSparse, a library that enables fast sparse operations on Graphcore IPUs by leveraging both the unique hardware characteristics of IPUs as well as any block structure defined in the data. We target two different types of sparsity: static, where the sparsity pattern is fixed at compile-time; and dynamic, where it can change each time the model is run. We present benchmark results for matrix multiplication for both of these modes on IPU with a range of block sizes, matrix sizes and densities. Results indicate that the PopSparse implementations are faster than dense matrix multiplications on IPU at a range of sparsity levels with large matrix size and block size. Furthermore, static sparsity in general outperforms dynamic sparsity. While previous work on GPAs has shown speedups only for very high sparsity (typically 99\% and above), the present work demonstrates that our static sparse implementation outperforms equivalent dense calculations in FP16 at lower sparsity (around 90%). IPU code is available to view and run at ipu.dev/sparsity-benchmarks, GPU code will be made available shortly.
翻译:利用稀疏性来降低大规模神经网络的运算成本已在深度学习社区引起广泛关注。尽管在减少FLOP和参数数量同时保持可接受任务性能方面取得了诸多成功,但实现实际速度提升通常更为困难,尤其是在使用低精度数值格式的通用加速器(如NVIDIA GPU)上。本研究提出PopSparse库,该库通过结合IPU独特的硬件特性以及数据中定义的块结构,在Graphcore IPU上实现快速稀疏运算。我们针对两种不同类型的稀疏性:静态稀疏(编译时固定稀疏模式)和动态稀疏(每次运行模型时可改变模式)。我们在IPU上针对多种块大小、矩阵尺寸和密度,展示了这两种模式的矩阵乘法基准测试结果。结果表明,在多种稀疏程度、大矩阵尺寸和块大小条件下,PopSparse实现相比IPU上的稠密矩阵乘法具有更快的速度。此外,静态稀疏性总体上优于动态稀疏性。虽然以往关于通用加速器的研究仅在高稀疏度(通常99%及以上)时实现加速,但本研究证明,我们的静态稀疏实现可在较低稀疏度(约90%)时以FP16格式超越等效稠密计算。IPU代码可在ipu.dev/sparsity-benchmarks查看和运行,GPU代码将很快发布。