SPT: Fine-Tuning Transformer-based Language Models Efficiently with Sparsification

Transformer-based large language models (e.g., BERT and GPT) achieve great success, and fine-tuning, which tunes a pre-trained model on a task-specific dataset, is the standard practice to utilize these models for downstream tasks. However, Transformer fine-tuning has long running time and high memory consumption due to the large size of the models. We propose the SPT system to fine-tune Transformer-based models efficiently by introducing sparsity. We observe that the memory consumption of Transformer mainly comes from storing attention weights for multi-head attention (MHA), and the majority of running time is spent on feed-forward network (FFN). Thus, we design the sparse MHA module, which computes and stores only large attention weights to reduce memory consumption, and the routed FFN module, which dynamically activates a subset of model parameters for each token to reduce computation cost. We implement SPT on PyTorch and customize CUDA kernels to run sparse MHA and routed FFN efficiently. Specifically, we use product quantization to identify the large attention weights and compute attention via sparse matrix multiplication for sparse MHA. For routed FFN, we batch the tokens according to their activated model parameters for efficient computation. We conduct extensive experiments to evaluate SPT on various model configurations. The results show that SPT consistently outperforms well-optimized baselines, reducing the peak memory consumption by up to 50% and accelerating fine-tuning by up to 2.2x.

翻译：基于Transformer的大规模语言模型（如BERT和GPT）取得了巨大成功，微调（在特定任务数据集上对预训练模型进行调优）是利用这些模型完成下游任务的标准做法。然而，由于模型规模巨大，Transformer微调存在运行时间长、内存消耗高等问题。我们提出SPT系统，通过引入稀疏性高效微调基于Transformer的模型。研究表明，Transformer的内存消耗主要来自存储多头注意力（MHA）的注意力权重，而大部分运行时间则耗费在前馈网络（FFN）上。为此，我们设计了稀疏MHA模块——仅计算和存储较大的注意力权重以降低内存消耗，以及路由FFN模块——为每个token动态激活部分模型参数以减少计算成本。我们在PyTorch上实现SPT，并定制CUDA内核以高效运行稀疏MHA和路由FFN。具体而言，稀疏MHA采用乘积量化识别较大注意力权重，并通过稀疏矩阵乘法计算注意力；路由FFN则根据激活的模型参数对token进行批处理以实现高效计算。我们通过大量实验在多种模型配置上评估SPT，结果表明，SPT始终优于经过充分优化的基线方法，峰值内存消耗最高降低50%，微调速度最高提升2.2倍。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日