Spiking Wavelet Transformer

Spiking neural networks (SNNs) offer an energy-efficient alternative to conventional deep learning by mimicking the event-driven processing of the brain. Incorporating the Transformers with SNNs has shown promise for accuracy, yet it is incompetent to capture high-frequency patterns like moving edge and pixel-level brightness changes due to their reliance on global self-attention operations. Porting frequency representations in SNN is challenging yet crucial for event-driven vision. To address this issue, we propose the Spiking Wavelet Transformer (SWformer), an attention-free architecture that effectively learns comprehensive spatial-frequency features in a spike-driven manner by leveraging the sparse wavelet transform. The critical component is a Frequency-Aware Token Mixer (FATM) with three branches: 1) spiking wavelet learner for spatial-frequency domain learning, 2) convolution-based learner for spatial feature extraction, and 3) spiking pointwise convolution for cross-channel information aggregation. We also adopt negative spike dynamics to strengthen the frequency representation further. This enables the SWformer to outperform vanilla Spiking Transformers in capturing high-frequency visual components, as evidenced by our empirical results. Experiments on both static and neuromorphic datasets demonstrate SWformer's effectiveness in capturing spatial-frequency patterns in a multiplication-free, event-driven fashion, outperforming state-of-the-art SNNs. SWformer achieves an over 50% reduction in energy consumption, a 21.1% reduction in parameter count, and a 2.40% performance improvement on the ImageNet dataset compared to vanilla Spiking Transformers.

翻译：脉冲小波变压器：脉冲神经网络通过模拟大脑的事件驱动处理，为传统深度学习提供了一种能效更高的替代方案。将变压器与脉冲神经网络结合在精度上已展现潜力，但由于其依赖全局自注意力操作，难以捕捉高频模式（如运动边缘和像素级亮度变化）。在脉冲神经网络中移植频率表示对事件驱动视觉至关重要，但极具挑战。为解决此问题，我们提出脉冲小波变压器（SWformer）——一种无注意力架构，通过利用稀疏小波变换以脉冲驱动方式有效学习全面的空间频率特征。其关键组件是频率感知令牌混合器（FATM），包含三个分支：1）用于空间频域学习的脉冲小波学习器，2）基于卷积的空间特征提取学习器，3）用于跨通道信息聚合的脉冲逐点卷积。我们还采用负脉冲动力学以进一步增强频率表示。这使得SWformer在捕捉高频视觉成分方面优于传统脉冲变压器，实验证据支持这一结论。在静态和神经形态数据集上的实验表明，SWformer以无乘法、事件驱动的方式有效捕捉空间频率模式，性能超越现有最优脉冲神经网络。与传统脉冲变压器相比，SWformer在ImageNet数据集上实现了能耗降低超过50%、参数量减少21.1%、性能提升2.40%。