NLP(natural language processsing) has achieved great success through the transformer model.However, the model has hundreds of millions or billions parameters,which is huge burden for its deployment on personal computer or small scale of server.To deal with it, we either make the model's weight matrix relatively sparser, or compress attention layer. Pattern pruning ,one of the most important pruning methods, permits selecting fixed number of parameters in each divided pattern block and prunes it. However, the effect of pattern pruning is strictly limited by the sparsity within a region of weights in each layer. In this paper,we first introduced Alternating Direction Method of Multipliers(ADMM) based pattern pruning framework to reshape the distribution of activation map. Specifically, we propose to formulate the pattern pruning on transformer as a constrained optimization and use ADMM to optimize the problem. In this way, the initial dense feature maps is transformed to rather regionally sparsified ones.Therefore, we can then achieve higher compression ratio with better performance based on pattern pruning method. Additionally, this paper provides a theoretical derivations of the ADMM with local sparsity. Finally, we also extend the proposed ADMM based framework on quantization to demonstrate its generalization and use SR-STE to avoid gradient vanishing problem. We conduct extensive experiments on classification tasks over GLUE datasets. Significantly, we achieve 50% percent compression ratio while maintaining overall score 80.1 on GLUE datasets.
翻译:自然语言处理(NLP)通过Transformer模型取得了巨大成功。然而,该模型具有数亿乃至数十亿参数,这对其在个人计算机或小规模服务器上的部署构成了巨大负担。为解决此问题,我们通常使模型的权重矩阵相对稀疏,或压缩注意力层。模式剪枝作为最重要的剪枝方法之一,允许在每个划分的模式块中选择固定数量的参数并进行剪除。然而,模式剪枝的效果严格受限于每层权重区域内稀疏度的约束。本文首次引入了基于交替方向乘子法(ADMM)的模式剪枝框架,以重塑激活图的分布。具体而言,我们将Transformer上的模式剪枝表述为一个约束优化问题,并使用ADMM对该问题进行优化。通过这种方式,初始的密集特征图被转化为区域稀疏化的特征图。因此,我们能够基于模式剪枝方法以更好的性能实现更高的压缩比。此外,本文提供了ADMM与局部稀疏性的理论推导。最后,我们还将所提出的基于ADMM的框架扩展至量化任务,以证明其泛化能力,并使用SR-STE以避免梯度消失问题。我们在GLUE数据集上的分类任务中进行了大量实验。显著的是,我们在GLUE数据集上实现了50%的压缩比,同时保持了80.1的整体得分。