An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based language models where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix - dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an order of magnitude on a wide range of GEMM shapes under 5 representative sparsity ratios (70%, 75%, 80%, 85%, 90%). Moreover, our SpMM kernel shows up to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library widely used in industry. We apply our sparse accelerator on widely-used Transformer-based language models including Bert-Mini, DistilBERT, Bert-Base, and BERT-Large. Our sparse inference software shows up to 1.5x speedup over Neural Magic's Deepsparse under same configurations on Xeon on Amazon Web Services under proxy production latency constraints. We also compare our solution with two framework-based inference solutions, ONNX Runtime and PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over PyTorch on Xeon under the latency constraints. All the source code is publicly available on Github: https://github.com/intel/intel-extension-for-transformers.

翻译：近年来，基于Transformer的语言模型已成为自然语言处理任务的标准方法。然而，工业应用中对吞吐量和延迟的严苛要求限制了其广泛采用。为弥合这一差距，模型压缩技术（如结构化剪枝）被用于提升推理效率。但现有的大多数神经网络推理运行时缺乏对结构化稀疏性的充分支持。本文提出了一种针对基于Transformer语言模型的高效稀疏深度学习推理软件栈，其中权重以固定块大小进行剪枝。我们的稀疏软件加速器利用Intel深度学习加速技术，最大化CPU上稀疏矩阵与稠密矩阵乘法（简称SpMM）的性能。在五种代表性稀疏率（70%、75%、80%、85%、90%）下，我们的SpMM内核在多种GEMM形状上比现有稀疏库（oneMKL、TVM、LIBXSMM）性能提升一个数量级。此外，相较业界广泛使用的优化稠密库oneDNN的稠密GEMM内核，我们的SpMM内核实现高达5倍加速。我们将该稀疏加速器应用于主流基于Transformer的语言模型，包括Bert-Mini、DistilBERT、Bert-Base及BERT-Large。在亚马逊云服务Xeon上满足代理生产延迟约束的相同配置下，我们的稀疏推理软件相较Neural Magic的Deepsparse实现高达1.5倍加速。我们还与基于框架的两种推理方案（ONNX Runtime和PyTorch）进行对比，在延迟约束条件下，在Xeon上相较ONNX Runtime实现37倍加速，相较PyTorch实现345倍加速。所有源代码已在GitHub上开源：https://github.com/intel/intel-extension-for-transformers。