SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

The recent advancements in large language models (LLMs) with billions of parameters have significantly boosted their performance across various real-world applications. However, the inference processes for these models require substantial energy and computational resources, presenting considerable deployment challenges. In contrast, human brains, which contain approximately 86 billion biological neurons, exhibit significantly greater energy efficiency compared to LLMs with a similar number of parameters. Inspired by this, we redesign 7 to 70 billion parameter LLMs using bio-plausible spiking mechanisms, emulating the efficient behavior of the human brain. We propose the first spiking large language model as recent LLMs termed SpikeLLM. Coupled with the proposed model, a novel spike-driven quantization framework named Optimal Brain Spiking is introduced to reduce the energy cost and accelerate inference speed via two essential approaches: first (second)-order differentiation-based salient channel detection, and per-channel salient outlier expansion with Generalized Integrate-and-Fire neurons. Our proposed spike-driven quantization can plug in main streams of quantization training methods. In the OmniQuant pipeline, SpikeLLM significantly reduces 25.51% WikiText2 perplexity and improves 3.08% average accuracy of 6 zero-shot datasets on a LLAMA2-7B 4A4W model. In the GPTQ pipeline, SpikeLLM realizes a sparse ternary quantization, which achieves additive in all linear layers. Compared with PB-LLM with similar operations, SpikeLLM also exceeds significantly. We will release our code on GitHub.

翻译：近年来，具有数十亿参数的大型语言模型（LLMs）的进展显著提升了其在各类实际应用中的性能。然而，这些模型的推理过程需要大量的能源和计算资源，带来了巨大的部署挑战。相比之下，人脑包含约860亿个生物神经元，与参数规模相近的LLMs相比，表现出更高的能效。受此启发，我们采用生物合理的脉冲机制，重新设计了参数规模为70亿至700亿的LLMs，以模拟人脑的高效行为。我们提出了首个脉冲大型语言模型，命名为SpikeLLM。结合所提出的模型，我们引入了一种新颖的脉冲驱动量化框架——最优脑脉冲，该框架通过两种核心方法降低能耗并加速推理：基于一阶（二阶）微分的显著通道检测，以及使用广义积分发放神经元进行逐通道显著异常值扩展。我们提出的脉冲驱动量化可嵌入主流的量化训练方法中。在OmniQuant流程中，SpikeLLM在LLAMA2-7B 4A4W模型上显著降低了25.51%的WikiText2困惑度，并在6个零样本数据集上平均准确率提升了3.08%。在GPTQ流程中，SpikeLLM实现了稀疏三元量化，该量化在所有线性层中均为加法操作。与具有类似操作的PB-LLM相比，SpikeLLM也显著超越。我们将在GitHub上公开代码。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日