Traditional pruning methods are known to be challenging to work in Large Language Models (LLMs) for Generative AI because of their unaffordable training process and large computational demands. For the first time, we introduce the information entropy of hidden state features into a pruning metric design, namely E-Sparse, to improve the accuracy of N:M sparsity on LLM. E-Sparse employs the information richness to leverage the channel importance, and further incorporates several novel techniques to put it into effect: (1) it introduces information entropy to enhance the significance of parameter weights and input feature norms as a novel pruning metric, and performs N:M sparsity without modifying the remaining weights. (2) it designs global naive shuffle and local block shuffle to quickly optimize the information distribution and adequately cope with the impact of N:M sparsity on LLMs' accuracy. E-Sparse is implemented as a Sparse-GEMM on FasterTransformer and runs on NVIDIA Ampere GPUs. Extensive experiments on the LLaMA family and OPT models show that E-Sparse can significantly speed up the model inference over the dense model (up to 1.53X) and obtain significant memory saving (up to 43.52%), with acceptable accuracy loss.
翻译:传统剪枝方法在生成式AI的大型语言模型(LLM)中难以发挥作用,因其训练过程成本高昂且计算需求庞大。本文首次将隐藏状态特征的信息熵引入剪枝度量设计,提出E-Sparse方法,以提升LLM上N:M稀疏性的精度。E-Sparse利用信息丰富度来权衡通道重要性,并进一步结合多项创新技术实现该目标:(1)引入信息熵增强参数权重和输入特征范数的重要性作为新型剪枝度量,在不修改剩余权重的前提下执行N:M稀疏性;(2)设计全局朴素洗牌与局部块洗牌算法,快速优化信息分布并充分应对N:M稀疏性对LLM精度的影响。E-Sparse基于FasterTransformer实现为稀疏通用矩阵乘法(Sparse-GEMM),并运行于NVIDIA Ampere GPU。在LLaMA系列与OPT模型上的大量实验表明,E-Sparse在可接受的精度损失下,能显著加速稠密模型的推理(最高达1.53倍),并实现显著的内存节约(最高达43.52%)。