Large Language Models (LLMs) have achieved remarkable success with their billion-level parameters, yet they incur high inference overheads. The emergence of activation sparsity in LLMs provides a natural approach to reduce this cost by involving only parts of the parameters for inference. However, existing methods only focus on utilizing this naturally formed activation sparsity in a post-training setting, overlooking the potential for further amplifying this inherent sparsity. In this paper, we hypothesize that LLMs can learn to be efficient by achieving more structured activation sparsity. To achieve this, we introduce a novel training algorithm, Learn-To-be-Efficient (LTE), designed to train efficiency-aware LLMs to learn to activate fewer neurons and achieve a better trade-off between sparsity and performance. Furthermore, unlike SOTA MoEfication methods, which mainly focus on ReLU-based models, LTE can also be applied to LLMs like LLaMA using non-ReLU activations. Extensive evaluation on language understanding, language generation, and instruction tuning tasks show that LTE consistently outperforms SOTA baselines. Along with our hardware-aware custom kernel implementation, LTE reduces LLaMA2-7B inference latency by 25% at 50% sparsity.
翻译:大型语言模型凭借其数十亿级别的参数取得了显著成功,但同时也带来了高昂的推理开销。LLM中激活稀疏性的出现,为通过仅使用部分参数进行推理来降低这一成本提供了一种自然的途径。然而,现有方法仅关注在训练后场景中利用这种自然形成的激活稀疏性,而忽视了进一步放大这种固有稀疏性的潜力。本文假设LLM能够通过学习实现更结构化的激活稀疏性来变得高效。为实现这一目标,我们提出了一种新颖的训练算法——学会高效(LTE),旨在训练具有效率意识的大型语言模型,使其学会激活更少的神经元,并在稀疏性与性能之间实现更好的权衡。此外,与主要关注基于ReLU模型的最先进专家混合化方法不同,LTE也可应用于像LLaMA这样使用非ReLU激活函数的LLM。在语言理解、语言生成和指令微调任务上的广泛评估表明,LTE始终优于最先进的基线方法。结合我们面向硬件的定制内核实现,LTE在50%稀疏度下将LLaMA2-7B的推理延迟降低了25%。