Sparse computation offers a compelling solution for the inference of Large Language Models (LLMs) in low-resource scenarios by dynamically skipping the computation of inactive neurons. While traditional approaches focus on ReLU-based LLMs, leveraging zeros in activation values, we broaden the scope of sparse LLMs beyond zero activation values. We introduce a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold, demonstrating that non-ReLU LLMs also exhibit sparse activation. To find the most efficient activation function for sparse computation, we propose a systematic framework to examine the sparsity of LLMs from three aspects: the trade-off between sparsity and performance, the predictivity of sparsity, and the hardware affinity. We conduct thorough experiments on LLMs utilizing different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU$^2$. The results indicate that models employing ReLU$^2$ excel across all three evaluation aspects, highlighting its potential as an efficient activation function for sparse LLMs. We will release the code to facilitate future research.
翻译:稀疏计算通过动态跳过非活跃神经元的计算,为低资源场景下大语言模型的推理提供了极具吸引力的解决方案。传统方法主要聚焦于基于ReLU的大语言模型,利用激活值中的零值,但我们将稀疏大语言模型的范围拓宽至零激活值之外。我们提出了一种通用方法,通过神经元输出幅度和定制的幅度阈值来定义神经元激活,证明非ReLU大语言模型同样具有稀疏激活特性。为找到适用于稀疏计算的最优激活函数,我们提出一个系统性框架,从三个维度考察大语言模型的稀疏性:稀疏性与性能的权衡、稀疏性的可预测性,以及硬件亲和性。我们对采用不同激活函数(包括ReLU、SwiGLU、ReGLU和ReLU$^2$)的大语言模型进行了全面实验。结果表明,采用ReLU$^2$的模型在所有三个评估维度均表现优异,凸显其作为稀疏大语言模型高效激活函数的潜力。我们将公开代码以促进未来研究。