The recent advancements in large language models (LLMs) with billions of parameters have significantly boosted their performance across various real-world applications. However, the inference processes for these models require substantial energy and computational resources, presenting considerable deployment challenges. In contrast, human brains, which contain approximately 86 billion biological neurons, exhibit significantly greater energy efficiency compared to LLMs with a similar number of parameters. Inspired by this, we redesign 7 to 70 billion parameter LLMs using bio-plausible spiking mechanisms, emulating the efficient behavior of the human brain. We propose the first spiking large language model as recent LLMs termed SpikeLLM. Coupled with the proposed model, a novel spike-driven quantization framework named Optimal Brain Spiking is introduced to reduce the energy cost and accelerate inference speed via two essential approaches: first (second)-order differentiation-based salient channel detection, and per-channel salient outlier expansion with Generalized Integrate-and-Fire neurons. Our proposed spike-driven quantization can plug in main streams of quantization training methods. In the OmniQuant pipeline, SpikeLLM significantly reduces 25.51% WikiText2 perplexity and improves 3.08% average accuracy of 6 zero-shot datasets on a LLAMA2-7B 4A4W model. In the GPTQ pipeline, SpikeLLM realizes a sparse ternary quantization, which achieves additive in all linear layers. Compared with PB-LLM with similar operations, SpikeLLM also exceeds significantly. We will release our code on GitHub.
翻译:近年来,具有数十亿参数的大型语言模型(LLMs)的进展显著提升了其在各类实际应用中的性能。然而,这些模型的推理过程需要大量的能源和计算资源,带来了巨大的部署挑战。相比之下,人脑包含约860亿个生物神经元,与参数规模相近的LLMs相比,表现出更高的能效。受此启发,我们采用生物合理的脉冲机制,重新设计了参数规模为70亿至700亿的LLMs,以模拟人脑的高效行为。我们提出了首个脉冲大型语言模型,命名为SpikeLLM。结合所提出的模型,我们引入了一种新颖的脉冲驱动量化框架——最优脑脉冲,该框架通过两种核心方法降低能耗并加速推理:基于一阶(二阶)微分的显著通道检测,以及使用广义积分发放神经元进行逐通道显著异常值扩展。我们提出的脉冲驱动量化可嵌入主流的量化训练方法中。在OmniQuant流程中,SpikeLLM在LLAMA2-7B 4A4W模型上显著降低了25.51%的WikiText2困惑度,并在6个零样本数据集上平均准确率提升了3.08%。在GPTQ流程中,SpikeLLM实现了稀疏三元量化,该量化在所有线性层中均为加法操作。与具有类似操作的PB-LLM相比,SpikeLLM也显著超越。我们将在GitHub上公开代码。