Spiking Transformers, which combine the scalability of Transformers with the sparse, energy-efficient property of Spiking Neural Networks (SNNs), have achieved impressive results in neuromorphic and vision tasks and attracted increasing attention. However, existing directly trained spiking transformers primarily focus on vision tasks. For language modeling with spiking transformer, convergence relies heavily on softmax-based spiking self-attention, which incurs high energy costs and poses challenges for neuromorphic deployment. To address this issue, we introduce Winner-Take-All (WTA) mechanisms into spiking transformers and propose two novel softmax-free, spike-driven self-attention modules: WTA Spiking Self-Attention (WSSA) and Causal WTA Spiking Self-Attention (CWSSA). Based on them, we design WTA-based Encoder-only Spiking Transformer (WE-Spikingformer) for masked language modeling and WTA-based Decoder-only Spiking Transformer (WD-Spikingformer) for causal language modeling, systematically exploring softmax-free, spiking-driven Transformer architectures trained end-to-end for natural language processing tasks. Extensive experiments on 16 datasets spanning natural language understanding, question-answering tasks, and commonsense reasoning tasks validate the effectiveness of our approach and highlight the promise of spiking transformers for general language modeling and energy-efficient artificial intelligence.
翻译:脉冲变压器结合了变压器的可扩展性与脉冲神经网络(SNN)的稀疏、节能特性,在神经形态和视觉任务中取得了显著成果并日益受到关注。然而,现有直接训练的脉冲变压器主要聚焦于视觉任务。在基于脉冲变压器的语言建模中,收敛性严重依赖于基于softmax的脉冲自注意力机制,该机制计算成本高昂且给神经形态部署带来挑战。为解决此问题,我们将胜者全得(WTA)机制引入脉冲变压器,并提出两种新型无softmax、脉冲驱动的自注意力模块:WTA脉冲自注意力(WSSA)和因果WTA脉冲自注意力(CWSSA)。基于此,我们设计了基于WTA的编码器-仅脉冲变压器(WE-Spikingformer)用于掩码语言建模,以及基于WTA的解码器-仅脉冲变压器(WD-Spikingformer)用于因果语言建模,系统探索了端到端训练、无softmax、脉冲驱动的变压器架构在自然语言处理任务中的应用。在涵盖自然语言理解、问答任务和常识推理任务的16个数据集上的广泛实验验证了本方法的有效性,并凸显了脉冲变压器在通用语言建模和节能人工智能领域的应用前景。