NLP(natural language processsing) has achieved great success through the transformer model.However, the model has hundreds of millions or billions parameters,which is huge burden for its deployment on personal computer or small scale of server.To deal with it, we either make the model's weight matrix relatively sparser, or compress attention layer. Pattern pruning ,one of the most important pruning methods, permits selecting fixed number of parameters in each divided pattern block and prunes it. However, the effect of pattern pruning is strictly limited by the sparsity within a region of weights in each layer. In this paper,we first introduced Alternating Direction Method of Multipliers(ADMM) based pattern pruning framework to reshape the distribution of activation map. Specifically, we propose to formulate the pattern pruning on transformer as a constrained optimization and use ADMM to optimize the problem. In this way, the initial dense feature maps is transformed to rather regionally sparsified ones.Therefore, we can then achieve higher compression ratio with better performance based on pattern pruning method. Additionally, this paper provides a theoretical derivations of the ADMM with local sparsity. Finally, we also extend the proposed ADMM based framework with SR-STE to demonstrate its generalization and to avoid gradient vanishing problem. We conduct extensive experiments on classification tasks over GLUE datasets. Significantly, we achieve 50% percent compression ratio while maintaining overall score 80.1 on GLUE dataset.
翻译:自然语言处理(NLP)通过Transformer模型取得了巨大成功。然而,该模型具有数亿乃至数十亿参数,这对其在个人计算机或小规模服务器上的部署构成了巨大负担。为解决此问题,我们通常使模型的权重矩阵相对稀疏化,或压缩注意力层。模式剪枝作为最重要的剪枝方法之一,允许在每个划分的模式块中选择固定数量的参数并进行剪枝。然而,模式剪枝的效果严格受限于每层权重区域内稀疏度的约束。本文首次引入了基于交替方向乘子法(ADMM)的模式剪枝框架,以重塑激活图的分布。具体而言,我们将Transformer上的模式剪枝表述为一个约束优化问题,并利用ADMM对该问题进行优化。通过这种方式,初始的稠密特征图被转化为区域稀疏化的特征图。因此,我们能够基于模式剪枝方法实现更高的压缩比,同时保持更好的性能。此外,本文还提供了ADMM结合局部稀疏性的理论推导。最后,我们进一步将所提出的基于ADMM的框架与SR-STE相结合,以展示其泛化能力并避免梯度消失问题。我们在GLUE数据集上进行了广泛的分类任务实验。显著地,我们在GLUE数据集上实现了50%的压缩比,同时保持了80.1的整体得分。