The combination of Spiking Neural Networks (SNNs) and Vision Transformers (ViTs) holds potential for achieving both energy efficiency and high performance, particularly suitable for edge vision applications. However, a significant performance gap still exists between SNN-based ViTs and their ANN counterparts. Here, we first analyze why SNN-based ViTs suffer from limited performance and identify a mismatch between the vanilla self-attention mechanism and spatio-temporal spike trains. This mismatch results in degraded spatial relevance and limited temporal interactions. To address these issues, we draw inspiration from biological saccadic attention mechanisms and introduce an innovative Saccadic Spike Self-Attention (SSSA) method. Specifically, in the spatial domain, SSSA employs a novel spike distribution-based method to effectively assess the relevance between Query and Key pairs in SNN-based ViTs. Temporally, SSSA employs a saccadic interaction module that dynamically focuses on selected visual areas at each timestep and significantly enhances whole scene understanding through temporal interactions. Building on the SSSA mechanism, we develop a SNN-based Vision Transformer (SNN-ViT). Extensive experiments across various visual tasks demonstrate that SNN-ViT achieves state-of-the-art performance with linear computational complexity. The effectiveness and efficiency of the SNN-ViT highlight its potential for power-critical edge vision applications.
翻译:脉冲神经网络(SNNs)与视觉Transformer(ViTs)的结合在实现高能效与高性能方面具有潜力,尤其适用于边缘视觉应用。然而,基于SNN的ViT与其人工神经网络(ANN)对应模型之间仍存在显著的性能差距。本文首先分析了基于SNN的ViT性能受限的原因,指出原始自注意力机制与时空脉冲序列之间存在不匹配问题。这种不匹配导致空间相关性下降和时序交互受限。为解决这些问题,我们受生物眼动注意机制的启发,提出了一种创新的眼动脉冲自注意力(SSSA)方法。具体而言,在空间维度上,SSSA采用一种新颖的基于脉冲分布的方法,有效评估基于SNN的ViT中查询与键值对之间的相关性。在时间维度上,SSSA采用眼动交互模块,动态聚焦于每个时间步选定的视觉区域,并通过时序交互显著增强对整体场景的理解。基于SSSA机制,我们开发了基于SNN的视觉Transformer(SNN-ViT)。在多种视觉任务上的大量实验表明,SNN-ViT以线性计算复杂度实现了最先进的性能。SNN-ViT的有效性和高效性凸显了其在功耗敏感的边缘视觉应用中的潜力。