Transformer-based large language models (e.g., BERT and GPT) achieve great success, and fine-tuning, which tunes a pre-trained model on a task-specific dataset, is the standard practice to utilize these models for downstream tasks. However, Transformer fine-tuning has long running time and high memory consumption due to the large size of the models. We propose the SPT system to fine-tune Transformer-based models efficiently by introducing sparsity. We observe that the memory consumption of Transformer mainly comes from storing attention weights for multi-head attention (MHA), and the majority of running time is spent on feed-forward network (FFN). Thus, we design the sparse MHA module, which computes and stores only large attention weights to reduce memory consumption, and the routed FFN module, which dynamically activates a subset of model parameters for each token to reduce computation cost. We implement SPT on PyTorch and customize CUDA kernels to run sparse MHA and routed FFN efficiently. Specifically, we use product quantization to identify the large attention weights and compute attention via sparse matrix multiplication for sparse MHA. For routed FFN, we batch the tokens according to their activated model parameters for efficient computation. We conduct extensive experiments to evaluate SPT on various model configurations. The results show that SPT consistently outperforms well-optimized baselines, reducing the peak memory consumption by up to 50% and accelerating fine-tuning by up to 2.2x.
翻译:基于Transformer的大规模语言模型(如BERT和GPT)取得了巨大成功,微调(在特定任务数据集上对预训练模型进行调优)是利用这些模型完成下游任务的标准做法。然而,由于模型规模巨大,Transformer微调存在运行时间长、内存消耗高等问题。我们提出SPT系统,通过引入稀疏性高效微调基于Transformer的模型。研究表明,Transformer的内存消耗主要来自存储多头注意力(MHA)的注意力权重,而大部分运行时间则耗费在前馈网络(FFN)上。为此,我们设计了稀疏MHA模块——仅计算和存储较大的注意力权重以降低内存消耗,以及路由FFN模块——为每个token动态激活部分模型参数以减少计算成本。我们在PyTorch上实现SPT,并定制CUDA内核以高效运行稀疏MHA和路由FFN。具体而言,稀疏MHA采用乘积量化识别较大注意力权重,并通过稀疏矩阵乘法计算注意力;路由FFN则根据激活的模型参数对token进行批处理以实现高效计算。我们通过大量实验在多种模型配置上评估SPT,结果表明,SPT始终优于经过充分优化的基线方法,峰值内存消耗最高降低50%,微调速度最高提升2.2倍。