Auto-regressive large language models such as GPT-3 require enormous computational resources to use. Traditionally, structured pruning methods are employed to reduce resource usage. However, their application to and efficacy for generative language models is heavily under-explored. In this paper we conduct an comprehensive evaluation of common structured pruning methods, including magnitude, random, and movement pruning on the feed-forward layers in GPT-type models. Unexpectedly, random pruning results in performance that is comparable to the best established methods, across multiple natural language generation tasks. To understand these results, we provide a framework for measuring neuron-level redundancy of models pruned by different methods, and discover that established structured pruning methods do not take into account the distinctiveness of neurons, leaving behind excess redundancies. In view of this, we introduce Globally Unique Movement (GUM) to improve the uniqueness of neurons in pruned models. We then discuss the effects of our techniques on different redundancy metrics to explain the improved performance.
翻译:自回归大型语言模型(如GPT-3)在使用时需要巨大的计算资源。传统上,采用结构化剪枝方法来减少资源消耗。然而,这些方法在生成式语言模型上的应用及其效果尚未得到充分探索。本文对常见的结构化剪枝方法(包括幅度剪枝、随机剪枝和移动剪枝)在GPT类型模型的前馈层上进行了全面评估。出乎意料的是,在多个自然语言生成任务中,随机剪枝的性能与最成熟的剪枝方法相当。为了理解这一结果,我们提出了一个框架来测量不同方法剪枝后模型的神经元级冗余度,并发现现有的结构化剪枝方法未考虑神经元的独特性,导致残留过多冗余。鉴于此,我们引入全局唯一移动剪枝(GUM)来提高剪枝模型中神经元的独特性。最后,我们讨论了所提技术对不同冗余度指标的影响,以解释其性能提升的原因。