Spiking Wavelet Transformer

Spiking neural networks (SNNs) offer an energy-efficient alternative to conventional deep learning by mimicking the event-driven processing of the brain. Incorporating the Transformers with SNNs has shown promise for accuracy, yet it is incompetent to capture high-frequency patterns like moving edge and pixel-level brightness changes due to their reliance on global self-attention operations. Porting frequency representations in SNN is challenging yet crucial for event-driven vision. To address this issue, we propose the Spiking Wavelet Transformer (SWformer), an attention-free architecture that effectively learns comprehensive spatial-frequency features in a spike-driven manner by leveraging the sparse wavelet transform. The critical component is a Frequency-Aware Token Mixer (FATM) with three branches: 1) spiking wavelet learner for spatial-frequency domain learning, 2) convolution-based learner for spatial feature extraction, and 3) spiking pointwise convolution for cross-channel information aggregation. We also adopt negative spike dynamics to strengthen the frequency representation further. This enables the SWformer to outperform vanilla Spiking Transformers in capturing high-frequency visual components, as evidenced by our empirical results. Experiments on both static and neuromorphic datasets demonstrate SWformer's effectiveness in capturing spatial-frequency patterns in a multiplication-free, event-driven fashion, outperforming state-of-the-art SNNs. SWformer achieves an over 50% reduction in energy consumption, a 21.1% reduction in parameter count, and a 2.40% performance improvement on the ImageNet dataset compared to vanilla Spiking Transformers.

翻译：脉冲神经网络（SNN）通过模拟大脑的事件驱动处理机制，为传统深度学习提供了一种节能替代方案。将Transformer与SNN结合在精度方面展现出潜力，但由于其依赖全局自注意力操作，难以捕捉运动边缘和像素级亮度变化等高频率模式。在SNN中引入频率表示对事件驱动视觉至关重要但极具挑战性。为解决此问题，我们提出脉冲小波变换器（SWformer）——一种无注意力架构，通过利用稀疏小波变换以脉冲驱动方式有效学习全面的时空频率特征。其关键组件为频率感知标记混合器（FATM），包含三个分支：1）用于空间频率域学习的脉冲小波学习器，2）用于空间特征提取的卷积学习器，3）用于跨通道信息聚合的脉冲逐点卷积。我们还采用负脉冲动力学进一步增强频率表示。这使得SWformer在捕获高频视觉组件方面优于传统的脉冲Transformer，实验结果证明了这一点。在静态和神经形态数据集上的实验表明，SWformer能以无乘法、事件驱动的方式有效捕获时空频率模式，性能超越最先进的SNN。与传统的脉冲Transformer相比，SWformer在ImageNet数据集上实现了能耗降低超过50%、参数量减少21.1%，性能提升2.40%。