Leveraging sparsity is crucial for optimizing large language model inference. however, modern LLMs employing SiLU as their activation function exhibit minimal activation sparsity. Recent research has proposed replacing SiLU with ReLU to induce significant activation sparsity and showed no downstream task accuracy degradation through fine tuning. However, taking full advantage of it required training a predictor to estimate this sparsity. In this paper, we introduce SparseInfer, a simple, light weight, and training free predictor for activation sparsity of ReLU field LLMs, in which activation sparsity is predicted by comparing only the sign bits of inputs and weights. To compensate for possible prediction inaccuracy, an adaptive tuning of the predictor's conservativeness is enabled, which can also serve as a control knob for optimizing LLM inference. The proposed method achieves approximately faster inference speed over the state of the art, with negligible accuracy loss of within 1%p.
翻译:利用稀疏性是优化大语言模型推理的关键。然而,采用SiLU作为激活函数的现代大语言模型表现出极低的激活稀疏度。近期研究提出将SiLU替换为ReLU以诱导显著的激活稀疏性,并通过微调证明其不会导致下游任务精度下降。但充分利用该特性需要训练预测器来估计稀疏度。本文提出SparseInfer——一种针对ReLU激活域大语言模型的简单、轻量且无需训练的激活稀疏度预测方法,该方法仅通过比较输入与权重的符号位即可预测激活稀疏度。为补偿可能的预测误差,系统支持自适应调节预测器的保守程度,该机制亦可作为优化大语言模型推理的控制旋钮。所提方法在推理速度上较现有最优技术提升约,同时保持低于1%的精度损失。