Activation sparsity refers to the existence of considerable weakly-contributed elements among activation outputs. As a prevalent property of the models using the ReLU activation function, it has been proven a promising paradigm to boost model inference efficiency. Nevertheless, most large language models (LLMs) adopt activation functions without intrinsic activation sparsity (e.g., GELU and Swish). Some recent efforts have explored introducing ReLU or its variants as the substitutive activation function to help LLMs achieve activation sparsity and inference acceleration, but few can simultaneously obtain high sparsity and comparable model performance. This paper introduces an effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity without decreasing model performance. Specifically, after substituting the activation function of LLMs with ReLU, ProSparse adopts progressive sparsity regularization with a factor smoothly increasing along sine curves in multiple stages. This can enhance activation sparsity and alleviate performance degradation by avoiding radical shifts in activation distribution. With ProSparse, we obtain high sparsity of 89.32% and 88.80% for LLaMA2-7B and LLaMA2-13B, respectively, achieving comparable performance to their original Swish-activated versions. Our inference acceleration experiments further demonstrate the practical acceleration brought by higher activation sparsity.
翻译:激活稀疏性是指激活输出中存在大量弱贡献元素。作为使用ReLU激活函数的模型的普遍特性,它已被证明是提升模型推理效率的一种有前景的范式。然而,大多数大型语言模型采用不具备内在激活稀疏性的激活函数(例如GELU和Swish)。近期一些研究尝试引入ReLU或其变体作为替代激活函数,以帮助大型语言模型实现激活稀疏性与推理加速,但鲜有方法能同时获得高稀疏性和可比的模型性能。本文提出一种名为"ProSparse"的有效稀疏化方法,旨在在不降低模型性能的前提下推动大型语言模型实现更高的激活稀疏性。具体而言,在使用ReLU替代大型语言模型的激活函数后,ProSparse采用多阶段渐进式稀疏正则化,其正则化因子沿正弦曲线平滑递增。这种方法通过避免激活分布的剧烈偏移,能够增强激活稀疏性并缓解性能退化。通过ProSparse,我们在LLaMA2-7B和LLaMA2-13B上分别获得了89.32%和88.80%的高稀疏度,同时实现了与其原始Swish激活版本可比的性能。我们的推理加速实验进一步证明了更高激活稀疏性带来的实际加速效果。