NLP(natural language processsing) has achieved great success through the transformer model.However, the model has hundreds of millions or billions parameters,which is huge burden for its deployment on personal computer or small scale of server.To deal with it, we either make the model's weight matrix relatively sparser, or compress attention layer. Pattern pruning ,one of the most important pruning methods, permits selecting fixed number of parameters in each divided pattern block and prunes it. However, the effect of pattern pruning is strictly limited by the sparsity within a region of weights in each layer. In this paper,we first introduced Alternating Direction Method of Multipliers(ADMM) based pattern pruning framework to reshape the distribution of activation map. Specifically, we propose to formulate the pattern pruning on transformer as a constrained optimization and use ADMM to optimize the problem. In this way, the initial dense feature maps is transformed to rather regionally sparsified ones.Therefore, we can then achieve higher compression ratio with better performance based on pattern pruning method. Additionally, this paper provides a theoretical derivations of the ADMM with local sparsity. Finally, we also extend the proposed ADMM based framework on quantization to demonstrate its generalization and use SR-STE to avoid gradient vanishing problem. We conduct extensive experiments on classification tasks over GLUE datasets. Significantly, we achieve 50% percent compression ratio while maintaining 55.4% Matthews correlation on COLA, 68.8% accuracy on RTE and overall score 80.1. Our framework also perform well on other tasks on GLUE datasets.
翻译:自然语言处理(NLP)通过Transformer模型取得了巨大成功。然而,该模型具有数亿甚至数十亿参数,这对其在个人计算机或小规模服务器上的部署构成了巨大负担。为解决这一问题,我们通常使模型的权重矩阵相对稀疏化,或压缩注意力层。模式剪枝作为最重要的剪枝方法之一,允许在每个划分的模式块中选择固定数量的参数并进行剪枝。然而,模式剪枝的效果严格受限于每层权重区域内稀疏度的约束。本文首次引入了基于交替方向乘子法(ADMM)的模式剪枝框架,以重塑激活图的分布。具体而言,我们将Transformer上的模式剪枝表述为一个约束优化问题,并利用ADMM对该问题进行优化。通过这种方式,初始的稠密特征图被转化为区域稀疏化的特征图。因此,我们能够基于模式剪枝方法,以更高的压缩比实现更优的性能。此外,本文还提供了ADMM在局部稀疏性条件下的理论推导。最后,我们将所提出的基于ADMM的框架扩展至量化任务,以证明其泛化能力,并采用SR-STE方法避免梯度消失问题。我们在GLUE数据集的分类任务上进行了大量实验。显著的是,在COLA任务上保持55.4%马修斯相关系数、在RTE任务上保持68.8%准确率以及总体得分80.1的同时,实现了50%的压缩比。我们的框架在GLUE数据集的其他任务上也表现优异。