Large Language Models (LLMs) have achieved remarkable success with their billion-level parameters, yet they incur high inference overheads. The emergence of activation sparsity in LLMs provides a natural approach to reduce this cost by involving only parts of the parameters for inference. Existing methods only focus on utilizing this naturally formed activation sparsity, overlooking the potential for further amplifying this inherent sparsity. In this paper, we hypothesize that LLMs can learn to be efficient by achieving more structured activation sparsity.To achieve this, we introduce a novel algorithm, Learn-To-be-Efficient (LTE), designed to train efficiency-aware LLMs to learn to activate fewer neurons and achieve a better trade-off between sparsity and performance. Furthermore, unlike SOTA MoEfication methods, which mainly focus on ReLU-based models, LTE can also be applied to LLMs like GPT and LLaMA with soft activation functions. We evaluate LTE on four models and eleven datasets. The experiments show that LTE achieves a better trade-off between sparsity and task performance. For instance, LTE with LLaMA provides a 1.83x-2.59x FLOPs speed-up on language generation tasks, outperforming the state-of-the-art methods.
翻译:大语言模型(LLMs)凭借其十亿级别的参数取得了显著成功,然而这也导致了高昂的推理开销。LLMs中激活稀疏性的出现提供了一种自然的方法,通过仅涉及部分参数进行推理来降低这一成本。现有方法仅关注利用这种自然形成的激活稀疏性,忽略了进一步放大这种固有稀疏性的潜力。本文提出假设:LLMs可以通过实现更结构化的激活稀疏性来学习变得高效。为此,我们引入了一种新颖的算法——Learn-To-be-Efficient(LTE),旨在训练效率感知的LLMs学习激活更少的神经元,并在稀疏性与性能之间实现更好的权衡。此外,与主要关注基于ReLU模型的最先进的MoEfication方法不同,LTE也可以应用于具有软激活函数的LLMs(如GPT和LLaMA)。我们在四个模型和十一个数据集上进行了评估。实验表明,LTE在稀疏性与任务性能之间实现了更好的权衡。例如,带有LTE的LLaMA在语言生成任务上提供了1.83倍至2.59倍的FLOPs加速,优于最先进的方法。