With the dramatically increased number of parameters in language models, sparsity methods have received ever-increasing research focus to compress and accelerate the models. While most research focuses on how to accurately retain appropriate weights while maintaining the performance of the compressed model, there are challenges in the computational overhead and memory footprint of sparse training when compressing large-scale language models. To address this problem, we propose a Parameter-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training in downstream tasks. Specifically, we first combine the data-free and data-driven criteria to efficiently and accurately measure the importance of weights. Then we investigate the intrinsic redundancy of data-driven weight importance and derive two obvious characteristics i.e., low-rankness and structuredness. Based on that, two groups of small matrices are introduced to compute the data-driven importance of weights, instead of using the original large importance score matrix, which therefore makes the sparse training resource-efficient and parameter-efficient. Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) on dozens of datasets demonstrate PST performs on par or better than previous sparsity methods, despite only training a small number of parameters. For instance, compared with previous sparsity methods, our PST only requires 1.5% trainable parameters to achieve comparable performance on BERT.
翻译:随着语言模型参数的急剧增加,偏狭方法得到了越来越多的研究焦点,以压缩和加速模型。虽然大多数研究侧重于如何准确保留适当的重量,同时保持压缩模型的性能,但在压缩大型语言模型时,在计算间接费用和稀缺培训的记忆足迹方面存在挑战。为了解决这一问题,我们建议采用一个参数效率低的分散培训方法,以减少在低能培训下游任务中可培训的参数数量。具体地说,我们首先将无数据和数据驱动的标准结合起来,以便高效和准确地衡量重量的重要性。然后我们调查数据驱动体重重要性的内在冗余,并得出两个明显的特征,即:低级别和结构性。在此基础上,我们引入了两组小矩阵来计算数据驱动的重量重要性,而不是使用原有的大型得分矩阵,从而使得稀少的培训资源效率和参数效率高。我们首先与多种网络(例如,BERT、ROBERTT和GPT-2)进行实验,然后在数十个数据性参数上进行比比前的PST标准更好的测试。