Large-scale Transformer models (TM) have demonstrated outstanding performance across various tasks. However, their considerable parameter size restricts their applicability, particularly on mobile devices. Due to the dynamic and intricate nature of gradients on TM compared to Convolutional Neural Networks, commonly used pruning methods tend to retain weights with larger gradient noise. This results in pruned models that are sensitive to sparsity and datasets, exhibiting suboptimal performance. Symbolic Descent (SD) is a general approach for training and fine-tuning TM. In this paper, we attempt to describe the noisy batch gradient sequences on TM through the cumulative process of SD. We utilize this design to dynamically assess the importance scores of weights.SEVEN is introduced by us, which particularly favors weights with consistently high sensitivity, i.e., weights with small gradient noise. These weights are tended to be preserved by SEVEN. Extensive experiments on various TM in natural language, question-answering, and image classification domains are conducted to validate the effectiveness of SEVEN. The results demonstrate significant improvements of SEVEN in multiple pruning scenarios and across different sparsity levels. Additionally, SEVEN exhibits robust performance under various fine-tuning strategies. The code is publicly available at https://github.com/xiaojinying/SEVEN.
翻译:大规模Transformer模型(TM)已在各种任务中展现出卓越性能,但其庞大的参数量限制了其适用性,尤其在移动设备上。由于Transformer模型的梯度相较于卷积神经网络具有动态且复杂的特性,常用剪枝方法倾向于保留梯度噪声较大的权重,导致剪枝后的模型对稀疏性和数据集敏感,呈现次优性能。符号下降法(SD)是一种训练和微调Transformer模型的通用方法。本文通过SD的累积过程来刻画Transformer模型上的噪声批次梯度序列,并利用该设计动态评估权重的重要性分数。我们提出了SEVEN方法,该方法特别青睐具有持续高敏感度的权重,即梯度噪声较小的权重,这些权重更易被SEVEN保留。我们在自然语言、问答和图像分类领域的多种Transformer模型上进行了广泛实验,以验证SEVEN的有效性。结果表明,SEVEN在多种剪枝场景和不同稀疏度水平下均取得显著提升。此外,SEVEN在各种微调策略下均展现出稳健性能。代码已开源:https://github.com/xiaojinying/SEVEN。