As the size of large language models continue to scale, so does the computational resources required to run it. Spiking neural networks (SNNs) have emerged as an energy-efficient approach to deep learning that leverage sparse and event-driven activations to reduce the computational overhead associated with model inference. While they have become competitive with non-spiking models on many computer vision tasks, SNNs have also proven to be more challenging to train. As a result, their performance lags behind modern deep learning, and we are yet to see the effectiveness of SNNs in language generation. In this paper, we successfully implement `SpikeGPT', a generative language model with pure binary, event-driven spiking activation units. We train the proposed model on three model variants: 45M, 125M and 260M parameters. To the best of our knowledge, this is 4x larger than any functional backprop-trained SNN to date. We achieve this by modifying the transformer block to replace multi-head self attention to reduce quadratic computational complexity to linear with increasing sequence length. Input tokens are instead streamed in sequentially to our attention mechanism (as with typical SNNs). Our preliminary experiments show that SpikeGPT remains competitive with non-spiking models on tested benchmarks, while maintaining 5x less energy consumption when processed on neuromorphic hardware that can leverage sparse, event-driven activations. Our code implementation is available at https://github.com/ridgerchu/SpikeGPT.
翻译:随着大型语言模型规模的持续扩大,其运行所需的计算资源也在不断增加。脉冲神经网络(SNNs)作为一种节能的深度学习方法出现,它利用稀疏且事件驱动的激活来降低模型推理相关的计算开销。尽管SNNs在许多计算机视觉任务上已与非脉冲模型具有竞争力,但事实证明其训练更具挑战性。因此,它们的性能落后于现代深度学习,而SNNs在语言生成中的有效性仍有待验证。本文成功实现了"SpikeGPT",一种采用纯二进制事件驱动脉冲激活单元的生成式语言模型。我们在三种模型变体(参数规模分别为4500万、1.25亿和2.6亿)上对提出的模型进行了训练。据我们所知,这是迄今为止任何通过反向传播训练的SNN中参数规模的四倍以上。我们通过修改Transformer模块,用替代多头自注意力机制的方式,实现了计算复杂度从随序列长度二次增长降低至线性增长。输入令牌改为依次流式输入至我们的注意力机制(与典型SNN相同)。初步实验表明,SpikeGPT在测试基准上仍与非脉冲模型具有竞争力,同时在可利用稀疏事件驱动激活的神经形态硬件上处理时,能耗可降低五倍。我们的代码实现见:https://github.com/ridgerchu/SpikeGPT。