Spiking Neural Networks (SNNs) demonstrate significant potential for energy-efficient neuromorphic computing through an event-driven paradigm. While training methods and computational models have greatly advanced, SNNs struggle to achieve competitive performance in visual long-sequence modeling tasks. In artificial neural networks, the effective receptive field (ERF) serves as a valuable tool for analyzing feature extraction capabilities in visual long-sequence modeling. Inspired by this, we introduce the Spatio-Temporal Effective Receptive Field (ST-ERF) to analyze the ERF distributions across various Transformer-based SNNs. Based on the proposed ST-ERF, we reveal that these models suffer from establishing a robust global ST-ERF, thereby limiting their visual feature modeling capabilities. To overcome this issue, we propose two novel channel-mixer architectures: \underline{m}ulti-\underline{l}ayer-\underline{p}erceptron-based m\underline{ixer} (MLPixer) and \underline{s}plash-and-\underline{r}econstruct \underline{b}lock (SRB). These architectures enhance global spatial ERF through all timesteps in early network stages of Transformer-based SNNs, improving performance on challenging visual long-sequence modeling tasks. Extensive experiments conducted on the Meta-SDT variants and across object detection and semantic segmentation tasks further validate the effectiveness of our proposed method. Beyond these specific applications, we believe the proposed ST-ERF framework can provide valuable insights for designing and optimizing SNN architectures across a broader range of tasks. The code is available at \href{https://github.com/EricZhang1412/Spatial-temporal-ERF}{\faGithub~EricZhang1412/Spatial-temporal-ERF}.
翻译:脉冲神经网络通过事件驱动范式,在节能型神经形态计算中展现出巨大潜力。尽管训练方法和计算模型已取得显著进展,但SNN在视觉长序列建模任务中仍难以取得具有竞争力的性能。在人工神经网络中,有效感受野是分析视觉长序列建模中特征提取能力的重要工具。受此启发,我们引入时空有效感受野来分析各类基于Transformer的SNN中的ERF分布。基于所提出的ST-ERF,我们发现这些模型难以建立鲁棒的全局ST-ERF,从而限制了其视觉特征建模能力。为解决这一问题,我们提出了两种新颖的通道混合器架构:基于\underline{多}层\underline{感}知机的\underline{混}合器与\underline{溅}射-\underline{重}构\underline{块}。这些架构通过在基于Transformer的SNN早期网络阶段中跨所有时间步增强全局空间ERF,提升了在具有挑战性的视觉长序列建模任务上的性能。在Meta-SDT变体以及目标检测和语义分割任务上进行的大量实验进一步验证了所提方法的有效性。除了这些具体应用,我们相信所提出的ST-ERF框架能为更广泛任务中的SNN架构设计与优化提供有价值的见解。代码发布于\href{https://github.com/EricZhang1412/Spatial-temporal-ERF}{\faGithub~EricZhang1412/Spatial-temporal-ERF}。