Transformer-based Spiking Neural Networks (SNNs) integrate SNNs with global self-attention and have demonstrated impressive performance. However, existing Transformer-based SNNs suffer from two fundamental limitations. First, they typically employ max pooling layers to reduce the size of feature maps, but the max pooling captures only the strongest response and fails to comprehensively preserve representative regional features. Second, the global self-attention involves all global feature interactions, resulting in computational redundancy and quadratic computational complexity, thus conflicting with the sparse and energy-efficient characteristics of SNNs. To address these challenges, we develop Local Structure-Aware Spiking Transformer (LSFormer), a novel Transformer-based Spiking Neural Network that incorporates Spiking Response Pooling (SPooling) and Local Structure-Aware Spiking Self-Attention (LS-SSA). For the first time, our LSFormer leverages a local dilated window mechanism to capture both local details and long-range dependencies. Experimental results demonstrate that our LSFormer achieves state-of-the-art performance compared to existing advanced Transformer-based SNNs. Notably, on the more challenging static dataset Tiny-ImageNet and neuromorphic dataset N-CALTECH101, LSFormer substantially outperforms state-of-the-art baselines by 4.3\% and 8.6\% in top-1 classification accuracy, respectively. These results highlight the potential of LSFormer to advance energy-efficient spiking models toward practical deployment in large-scale vision applications.
翻译:基于Transformer的脉冲神经网络(SNNs)将SNNs与全局自注意力机制相结合,已展现出卓越的性能表现。然而,现有基于Transformer的SNNs存在两个根本性缺陷。其一,它们通常采用最大池化层来缩减特征图尺寸,但最大池化仅能捕获最强响应,无法完整保留具有代表性的区域特征。其二,全局自注意力涉及所有全局特征交互,导致计算冗余和二次计算复杂度,这与SNNs的稀疏性和能效特性相冲突。针对这些挑战,我们提出了局部结构感知脉冲Transformer(LSFormer)——一种融合脉冲响应池化(SPooling)和局部结构感知脉冲自注意力(LS-SSA)的新型Transformer脉冲神经网络。LSFormer首次利用局部膨胀窗口机制同时捕获局部细节和长距离依赖关系。实验结果表明,与现有先进的Transformer SNNs相比,我们的LSFormer达到了最优性能。值得注意的是,在更具挑战性的静态数据集Tiny-ImageNet和神经形态数据集N-CALTECH101上,LSFormer的top-1分类准确率分别以4.3%和8.6%的显著优势超越最先进的基线模型。这些结果彰显了LSFormer在推动能效脉冲模型向大规模视觉应用实际部署方面的潜力。