Large Language Models (LLMs) have achieved remarkable success with their billion-level parameters, yet they incur high inference overheads. The emergence of activation sparsity in LLMs provides a natural approach to reduce this cost by involving only parts of the parameters for inference. Existing methods only focus on utilizing this naturally formed activation sparsity, overlooking the potential for further amplifying this inherent sparsity. In this paper, we hypothesize that LLMs can learn to be efficient by achieving more structured activation sparsity. To achieve this, we introduce a novel algorithm, Learn-To-be-Efficient (LTE), designed to train efficiency-aware LLMs to learn to activate fewer neurons and achieve a better trade-off between sparsity and performance. Furthermore, unlike SOTA MoEfication methods, which mainly focus on ReLU-based models, LTE can also be applied to LLMs like GPT and LLaMA with soft activation functions. We evaluate LTE on four models and eleven datasets. The experiments show that LTE achieves a better trade-off between sparsity and task performance. For instance, LTE with LLaMA provides a 1.83x-2.59x FLOPs speed-up on language generation tasks, outperforming the state-of-the-art methods.
翻译:大语言模型(LLMs)凭借其数十亿级别的参数取得了显著成功,但同时也带来了高昂的推理开销。LLMs中激活稀疏性的出现为降低这一成本提供了自然途径——仅需调用部分参数进行推理即可。现有方法仅关注利用这种自然形成的激活稀疏性,忽视了进一步放大这种固有稀疏性的潜力。本文提出假说:大语言模型可以通过实现更结构化的激活稀疏性来学习高效推理。为此,我们提出一种新型算法“Learn-To-be-Efficient(LTE)”,旨在训练效率感知型大语言模型,使其学习激活更少的神经元,并在稀疏性与性能之间达成更优权衡。此外,与主要针对ReLU型模型的最先进MoEfication方法不同,LTE还可应用于采用软激活函数的GPT、LLaMA等大语言模型。我们在四个模型和十一个数据集上对LTE进行评估。实验表明,LTE在稀疏性与任务性能之间实现了更优权衡。例如,采用LLaMA的LTE在语言生成任务上实现了1.83倍至2.59倍的FLOPs加速,性能优于现有最先进方法。