Traditional pruning methods are known to be challenging to work in Large Language Models (LLMs) for Generative AI because of their unaffordable training process and large computational demands. For the first time, we introduce the information entropy of hidden state features into a pruning metric design, namely E-Sparse, to improve the accuracy of N:M sparsity on LLM. E-Sparse employs the information richness to leverage the channel importance, and further incorporates several novel techniques to put it into effect: (1) it introduces information entropy to enhance the significance of parameter weights and input feature norms as a novel pruning metric, and performs N:M sparsity without modifying the remaining weights. (2) it designs global naive shuffle and local block shuffle to quickly optimize the information distribution and adequately cope with the impact of N:M sparsity on LLMs' accuracy. E-Sparse is implemented as a Sparse-GEMM on FasterTransformer and runs on NVIDIA Ampere GPUs. Extensive experiments on the LLaMA family and OPT models show that E-Sparse can significantly speed up the model inference over the dense model (up to 1.53X) and obtain significant memory saving (up to 43.52%), with acceptable accuracy loss.
翻译:传统剪枝方法因训练过程成本高昂且计算需求巨大,通常难以应用于生成式AI中的大语言模型。我们首次将隐藏状态特征的信息熵引入剪枝度量设计,提出E-Sparse方法,以提升大语言模型上N:M稀疏性的精度。E-Sparse利用信息丰富度衡量通道重要性,并融合多项创新技术加以实现:(1) 引入信息熵增强参数权重与输入特征范数的重要性,形成新型剪枝度量,在保持剩余权重不变的前提下执行N:M稀疏化;(2) 设计全局朴素洗牌与局部块洗牌策略,快速优化信息分布,充分应对N:M稀疏性对大语言模型精度的影响。E-Sparse作为Sparse-GEMM算子集成于FasterTransformer,并在NVIDIA Ampere GPU上运行。在LLaMA系列与OPT模型上的大量实验表明,相较于稠密模型,E-Sparse可实现高达1.53倍的推理加速,节省43.52%内存,且精度损失在可接受范围内。