Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-$p\%$ sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., $1-\mathrm{sparsity\ ratio}$) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.
翻译:激活稀疏性指的是激活输出中存在大量可被消除的弱贡献元素,这对关注大语言模型(LLMs)的许多重要应用具有益处。尽管促进LLMs内部更高的激活稀疏性值得深入研究,但现有工作缺乏对激活稀疏性与潜在影响因素之间关联的全面定量研究。本文针对仅解码器Transformer架构LLMs的激活稀疏性,开展了定量缩放特性与影响因素的全面研究。具体而言,我们提出了PPL-$p\%$稀疏度——一种适用于任意激活函数的精确且性能感知的激活稀疏度度量指标。通过大量实验,我们发现了若干重要现象。首先,不同激活函数表现出可比性能但相反的训练时稀疏度趋势:对于SiLU激活和ReLU激活的LLMs,激活比率(即$1-\mathrm{稀疏度比率}$)分别随训练数据量呈现收敛的递增幂律和递减对数空间幂律演化。这表明ReLU作为激活函数比SiLU更高效,并能利用更多训练数据提升激活稀疏性。其次,在特定瓶颈点以下,激活比率随宽深比线性增长,揭示了固定参数量下更深架构的潜在优势。最后,在相似宽深比条件下,我们意外发现激活稀疏度的极限值随参数量变化微弱,即LLMs内部的激活模式对参数量不敏感。这些面向更高激活稀疏性LLMs的经验定律,对于提升LLMs的效率和可解释性具有重要意义。