Spiking Wavelet Transformer

Spiking neural networks (SNNs) offer an energy-efficient alternative to conventional deep learning by mimicking the event-driven processing of the brain. Incorporating the Transformers with SNNs has shown promise for accuracy, yet it is incompetent to capture high-frequency patterns like moving edge and pixel-level brightness changes due to their reliance on global self-attention operations. Porting frequency representations in SNN is challenging yet crucial for event-driven vision. To address this issue, we propose the Spiking Wavelet Transformer (SWformer), an attention-free architecture that effectively learns comprehensive spatial-frequency features in a spike-driven manner by leveraging the sparse wavelet transform. The critical component is a Frequency-Aware Token Mixer (FATM) with three branches: 1) spiking wavelet learner for spatial-frequency domain learning, 2) convolution-based learner for spatial feature extraction, and 3) spiking pointwise convolution for cross-channel information aggregation. We also adopt negative spike dynamics to strengthen the frequency representation further. This enables the SWformer to outperform vanilla Spiking Transformers in capturing high-frequency visual components, as evidenced by our empirical results. Experiments on both static and neuromorphic datasets demonstrate SWformer's effectiveness in capturing spatial-frequency patterns in a multiplication-free, event-driven fashion, outperforming state-of-the-art SNNs. SWformer achieves an over 50% reduction in energy consumption, a 21.1% reduction in parameter count, and a 2.40% performance improvement on the ImageNet dataset compared to vanilla Spiking Transformers.

翻译：脉冲神经网络（SNN）通过模拟大脑的事件驱动处理机制，为传统深度学习提供了一种高能效的替代方案。将Transformer与SNN结合在准确性上展现了潜力，但由于其依赖全局自注意力机制，难以捕捉高频模式（如运动边缘和像素级亮度变化）。在SNN中移植频率表示对于事件驱动视觉至关重要但充满挑战。为解决此问题，我们提出脉冲小波变换器（SWformer）——一种无注意力架构，通过利用稀疏小波变换以脉冲驱动方式有效学习全面的空间-频率特征。其核心组件是频率感知令牌混合器（FATM），包含三个分支：1）用于空间-频域学习的脉冲小波学习器，2）用于空间特征提取的基于卷积的学习器，以及3）用于跨通道信息聚合的逐点脉冲卷积。为增强频率表示，我们还引入了负脉冲动力学。这使得SWformer在捕捉高频视觉成分方面优于传统脉冲Transformer，实验结果表明了这一点。在静态数据集和神经形态数据集上的实验证明了SWformer以无乘法、事件驱动方式捕捉空间-频率模式的有效性，其性能优于最先进的SNN。与标准脉冲Transformer相比，SWformer在ImageNet数据集上实现了能耗降低超过50%、参数量减少21.1%、性能提升2.40%。